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This  volume  Is  an  anthology  of  papers  organized  in  two  parts,  rare  une 
covers  concurrency  control  and  consists  of  four  papers.  Part  Two  covers 
recovery  algorithms  and  consists  of  four  papers. 


The  first  paper  presents  an  overview  of  concurrency  control  algorithms  for 
distributed  database  systems.  It  decomposes  the  concurrency  control  prob¬ 
lem  into  several  subproblems,  and  describes  the  known  solutions  to  the  sub¬ 
problems.  This  paper  extends  Bernstein  and  Goodman's  earlier  survey  of  the 
subject  ("Concurrency  Control  in  Distributed  Database  Systems,"  ACM  Compu¬ 
ting  Surveys  13,2  (June  1981)  by  using  an  improved  decomposition  into  sub¬ 
problems  and  by  including  more  algorithms,  notably  certifiers  and  multi- 
version  algorithms. 


In  a  multiversion  concurrency  control  algorithm,  each  write  operation  on  a 
data  item,  x,  produces  a  new  "version"  of  x,  leaving  old  versions  of  x  un¬ 
changed.  The  second  paper  presents  a  comprehensive  description  and  logical 
analysis  of  multiversion  concurrency  control  algorithms.  It  extends  serial 
izability  theory  to  handle  the  multiversion  semantics  of  "write."  It 
describes  multiversion  concurrency  control  algorithms  based  on  locking  and 
timestamping,  and  proves  them  correct  using  the  extended  theory. 


The  third  and  fourth  papers  present  mathematical  analyses  of  two-phase 
locking  algorithms.  The  third  paper  describes  a  queueing  theoretic  approac 
coupled  with  a  random  graph  analysis  of  deadlocks.  The  fourth  paper  des¬ 
cribes  a  new  control  theoretic  analysis  that  uses  significantly  weaker 
assumptions  than  the  standard  queueing  theoretic  approach;  thus,  its 
conclusions  are  quite  general.  These  analyses  only  study  a  few  of  the  man} 
available  concurrency  control  algorithms,  and  therefore  are  not  comprehen¬ 
sive.  They  principally  demonstrate  the  feasibility  of  these  approaches , but 
leave  open  a  more  complete  comparative  analysis  of  algorithms.  Such  a 
comparative  analysis  using  simulation  techniques  appears  in  the  second 
volume  of  this  report. 


A  nonmathematlcal  survey  of  these  algorithms  is  presented  in  the  fifth 
paper.  It  describes  methods  for:  undoing  a  transaction  after  it  aborts; 
redoing  committed  transactions  after  a  site  falls;  extending  any  centraliz< 
undo  or  redo  algorithm  to  a  distributed  system. 


A  mathematical  model  of  recovery  algorithms  is  presented  in  the  sixth 
paper.  Each  of  the  algorithms  described  in  the  previous  paper  is  proved 
correct  using  this  model. 


The  seventh  paper  presents  a  new  algorithm  for  site  recovery  in  a  distribu¬ 
ted  database  system.  The  problem  is  how  to  bring  a  site's  database  up  to 
date  after  it  has  recovered  from  failure.  The  solution  allows  different 
portions  of  the  database  to  be  brought  up  to  date  independently,  thereby 
avoiding  a  strongly  synchronized  bulk  transfer  of  the  entire  database. 


The  last  paper  presents  an  optimal  algorithm  for  undoing  a  set  of  trans¬ 
actions  to  a  consistent  state.  The  problem  is  that  when  one  transaction 
is  undone,  any  transaction  that  read  its  output  must  also  be  undone.  Thus, 
undo's  may  cascade  requiring  many  other  transactions  to  be  undone.  It  is 
shown  that  there  is  a  unique  best  set  of  transactions  to  undo,  and  a  fast 
algorithm  for  finding  this  set  is  described. 
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Distributed  Database  Control  and  Allocation 


Executive  Summary 


The  design  of  a  distributed  database  management  system  (DDBMS) 
involves  many  critical  design  decisions.  It  is  recognized  that  one 
of  the  most  important  of  these  design  decisions  is  the  choice  of 
the  concurrency  control  algorithm  to  be  used.  Many  concurrency 
control  algorithms  for  DDBMSs  have  been  proposed  (48  principal  ones 
were  identified  in  the  previous  effort*),  but  few  studies  have  been 
undertaken  to  rigorously  compare  their  performance  and  other 
characteristics.  One  possible  reason  for  this  is  that,  in  detail, 
these  algorithms  seem  very  different,  thus  making  comparison 
difficult.  As  a  result,  the  DDBMS  designer  finds  it  difficult  to 
choose  the  concurrency  control  algorithm  which  is  given  the  design 
parameters  of  the  particular  system  under  consideration. 

Consequently,  the  Distributed  Database  (DDB )  Control  and  Allocation 
project  had  the  following  objectives: 

1.  Review  the  distributed  concurrency  control  research 
published  in  the  literature  and  incorporate  that  research 
into  the  taxonomy  of  the  distributed  database  concurrency 
control  algorithms.  Based  on  this  taxonomy,  develop  a 
new  framework  for  distributed  database  control. 

2.  Develop  new  distributed  database  concurrency  control 
algorithms  using  the  framework  developed  in  1. 

3.  Simulate  the  performance  of  the  distributed  database 
concurrency  control  algorithms  that  are  found  to  be 
dominant  in  the  previous  study. 

4.  Build  an  analytical  model  of  distributed  database 
concurrency  control . 

5.  Survey  the  current  studies  of  reliability  and 
recovery  of  distributed  database  systems  and  the  analysis 
of  published  algorithms. 


*This  effort  was  a  follow-on  to  a  previous  effort  conducted  by  the 
same  research  team  which  is  documented  in  "Fundamental  Algorithms 
for  Concurrency  Control  in  Distributed  Database  Systems,"  Philip  A. 
Bernstein,  et  al,  RADC  TR-80-158,  May  1980. 


6.  Develop  a  framework  for  reliability  and  recovery  of 
distributed  database  systems. 

7.  Consolidate  the  results  of  the  previous  tasks  into  a 
system  designer's  handbook. 

The  first  objective  was  achieved  by  means  of  the  framework 
discussed  in  Section  II  of  Volume  I.  The  framework  facilitates  the 
taxonomy  of  distributed  concurrency  control  algorithms  by 
identifying  the  essential  component  functions  of  distributed 
concurrency  control  mechanisms.  The  following  is  a  summary  of  that 
section . 

A  distributed  database  system  (DDBS)  is  a  database  system 
that  provides  commands  to  read  and  write  data  that  is 
stored  at  multiple  sites  of  a  network.  If  users  access  a 
DDBS  concurrently,  they  may  interfere  with  each  other  by 
attempting  to  read  and/or  write  the  same  data. 
Concurrency  control  is  the  activity  of  preventing  such 
behavior.  The  following  simple  model  of  DDBS  structure 
and  behavior  highlights  those  aspects  of  a  DDBS  that  are 
important  for  understanding  concurrency  control. 

A  database  consists  of  a  set  of  data  items  (e.g.,  a 
simple  variable,  file,  record,  or  page,  etc.).  Users 
access  data  items  by  issuing  read  and  write  operations. 
Read(x)  returns  the  current  value  of  x,  while  write(x, 
new-value)  updates  the  current  value  of  x  to  new-value. 

Users  interact  with  the  DBMS  by  executing  programs  called 
transactions.  Each  site  of  a  DDBS  runs  one  or  more  of 
the  following  software  modules:  a  transaction  manager 
(TM),  a  data  manager  (DM),  or  a  scheduler.  Transactions 
talk  to  TMs ,  TMs  talk  to  schedulers,  schedulers  talk 
among  themselves  and  to  DMs ,  DMs  manage  the  data.  Each 
transaction  issues  a  'begin'  operation  to  its  TM  when  it 
starts  executing  and  an  'end'  when  it  is  finished.  The 
TM  forwards  each  read  and  write  to  a  scheduler.  The 
scheduler  controls  the  order  in  which  DMs  process  reads 
and  writes.  The  DM  executes  each  read  and  write  it 
receives  for  the  data  items  at  its  site. 

The  DDBS  modules  that  are  most  important  to  concurrency 
control  are  schedulers.  There  are  three  types  of 
schedulers:  two-phase  locking,  timestamp  ordering,  and 
serialization  graph  checking.  Two-phase  locking  requires 
setting  a  read  (or  write)  -lock  for  transaction  Ti  before 
outputting  ri[x]  (or  wi[x]).  The  lock  must  be  held  at 
least  until  the  operation  is  executed  by  the  appropriate 
DM.  Different  transactions  cannot  simultaneously  hold 
'conflicting'  locks.  Two  locks  conflict  if  they  are  on 
the  same  data  item  and  at  least  one  is  a  write-lock.  In 
timestamp  ordering  (T/o)  each  transaction  is  assigned  a 
globally  unique  timestamp  by  its  TM.  All  pairs  of 
conflicting  operations  are  then  output  in  timestamp 
order.  A  serialization  graph  (SG)  is  a  directed  graph 
whose  nodes  are  transactions,  such  as  TO,  ...,  Tn,  and 
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whose  edges  are  all  Ti — >Tj,  such  that,  for  some  x,  (1) 

Ti  reads  x  before  Tj  writes  x,  or  (2)  Ti  writes  x  before 
Tj  reads  x,  or  (3)  Ti  writes  x  before  Tj  writes  x.  A 
serialization  graph  checking  scheduler  works,  therefore, 
by  explicitly  building  a  serialization  graph  and  checking 
it  for  cycles. 

Each  of  these  schedulers  can  be  adapted  to  work  as  a 
certifier  (the  term  'certifier'  refers  to  a  scheduling 
philosophy  rather  than  a  specific  scheduling  rule).  A 
certifier  makes  its  decisions  on  a  per-transaction  basis. 

That  is,  when  it  receives  an  operation,  it  internally 
stores  information  about  the  operation  and  outputs  it  as 
soon  as  all  earlier  conflicting  operations  have  been 
acknowledged.  When  a  transaction  ends,  its  TM  sends  an 
'end'  operation.  At  this  point,  the  certifier  checks  its 
stored  information  to  see  whether  the  transaction 
executed  serializably .  If  it  did,  the  certifier 
certifies  the  transaction,  allowing  it  to  terminate; 
otherwise,  it  aborts  the  transaction. 

In  addition  to  the  type  of  scheduler,  the  location  of  the 
scheduler  and  how  replicated  data  is  handled  must  be 
taken  into  consideration  for  distributed  database 
concurrency  control.  Instead  of  one  scheduler  for  the 
whole  system,  there  is  one  scheduler  per  DM.  The 
scheduler  normally  runs  at  the  same  site  as  the  DM  and 
schedules  all  operations  that  the  DM  executes.  There  are 
three  approaches  to  handling  data  replication:  ’do 
nothing',  'primary  copy',  and  'voting'.  In  the  'do 
nothing'  approach  each  read  reads  from  the  latest 
transaction  preceding  it  that  wrote  into  any  copy  of  the 
data  item  and  writes  into  all  copies  of  each  data  item 
using  a  serializable  scheduler.  In  the  'primary  copy' 
approach,  some  copy  of  each  data  item  is  designated  the 
primary  copy,  such  that  each  TM  translates  ri[x]  into 
ri[xj]  for  some  copy  xj,  but  translates  wi[x]  into  a 
single  write  on  the  primary  copy  and  the  primary  copy's 
scheduler  issues  writes  on  the  other  copies  of  x.  In  the 
'voting'  approach,  writes  are  issued  to  all  copies  of 
each  data  item  and  when  the  scheduler  is  ready  to  output 
wi[xj],  it  sends  a  vote  of  'yes'  to  the  vote  collector 
for  x.  When  the  vote  collector  receives  'yes'  votes  from 
a  majority  of  schedulers,  it  tells  all  schedulers  to 
output  their  writes. 

The  second  objective  of  developing  new  algorithms  using  this 
framework  is  achieved  by  the  algorithms  described  in  Section  III  of 
Volume  I.  That  section  describes  the  work  in  extending  concurrency 
control  theory  to  multiversion  databases.  In  a  mul tiversion 
database  each  write  operation  on  a  data  item,  x,  produces  a  new 
"version"  of  x,  leaving  old  versions  of  x  unchanged.  When 
transactions  issue  operations  that  specify  data  items,  the  system 
must  translate  these  into  operations  that  specify  versions.  In  a 
single  version  database,  concurrency  control  correctness  depends  on 
the  order  in  which  reads  and  writes  are  processed.  In  a 
multiversion  database,  correctness  depends  on  translation  as  well 
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The  main  idea  is  one-copy  serializability :  an  execution  of 
transactions  in  a  multiversion  database  is  one-copy  serializable 
(1-SR)  if  it  is  equivalent  to  a  serial  execution  of  the  same 
transactions  in  a  single  version  database.  A  multiversion 
concurrency  control  algorithm  is  correct  if  all  of  its  executions 
are  1-SR.  Effective  necessary  and  sufficient  conditions  for  an 
execution  to  be  1-SR  were  derived  which  use  the  concept  of  version 
order.  A  graph  structure,  multiversion  serialization  graphs 
(MVSGs),  that  helps  check  these  conditions  is  given.  Once  a 
version  order  is  fixed,  an  execution  is  1-SR  if  and  only  if  its 
MVSG  is  acyclic.  MVSGs  are  analogous  to  the  serialization  graphs 
widely  used  in  single  version  concurrency  control  theory.  The 
theory  was  applied  to  three  multiversion  concurrency  control 
algorithms.  One  algorithm  uses  timestamps,  one  uses  locking,  and 
one  combines  locking  with  timestamps. 

The  third  objective  of  simulating  and  evaluating  the  performance  of 
distributed  database  concurrency  control  algorithms  is  documented 
in  Sections  II  through  VI  of  Volume  II.  Section  II  presents  a 
study  that  analyzed  the  relationship  between  the  performance  of  the 
two-phase  locking  algorithm  and  the  following  system  parameters: 
access  distribution  of  the  database,  data  granularity,  transaction 
size,  and  multiprogramming  level.  In  a  distributed  database 
system,  communication  delay  is  also  a  major  factor  affecting  the 
performance  of  a  concurrency  control  algorithm.  Section  III 
documents  an  analysis  of  the  relationship  between  the  performance 
of  the  two-phase  locking  algorithm  and  communication  delay. 

Another  important  factor  that  affects  performance  is  the  number  of 
read-only  transactions  relative  to  the  number  of  write 
transactions,  i.e.,  the  ratio  of  read-only  to  write  transactions. 
Section  IV  documents  an  analysis  of  the  relationship  between  the 
performance  of  the  two-phase  locking  algorithm  and  that  ration. 

Section  V  extends  the  analysis  to  algorithms  based  on  timestamps  by 
presenting  a  comparison  of  the  performance  of  three  distributed 
concurrency  control  algorithms  —  the  basic  timestamp,  multiple 
version  timestamp,  and  two-phase  locking.  Section  VI  documents  the 
analysis  of  the  two-phase  locking  algorithm  in  more  detail  and  the 
refinement  of  the  algorithm  into  nine  algorithms.  In  addition,  the 
previous  two  timestamp  algorithms  were  reevaluated  in  more  detail 
and  a  new  timestamp-based  algorithm,  the  dynamic  timestamp 
algorithm,  was  analyzed.  The  performance  of  the  twelve  algorithms 
was  then  compared.  The  results  of  these  simulation  studies  could 
form  the  basis  for  designing  a  distributed  database  designer's  aid. 
The  designer's  aid  would  help  the  system  designer  to  design 
distributed  transactions,  partition  the  database  into  fragments, 
replicate  and  distribute  the  fragments,  and  choose  the  concurrency 
control  algorithm  that  performs  best  in  his  system  environment. 

The  fourth  objective  of  analytical  modeling  of  the  distributed 
concurrency  control  algorithms  was  achieved  through  the  analytical 
models  described  in  Sections  IV  and  V  of  Volume  I.  Section  IV 
describes  using  a  queueing  theoretic  approach  coupled  with  a  random 
graph  analysis  of  deadlocks.  An  analytical  model  was  developed  for 
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the  general  two-phase  locking  algorithm,  which  can  be  used  to 
estimate  the  steady  state  rates  of  conflicts  and  deadlocks  in  the 
system.  Dynamic  two-phase  locking  was  analyzed,  proceeding  from  a 
simple  deadlock  prevention  two-phase  locking  algorithm  (which  gives 
worst  case  bounds  on  the  performance  of  any  deadlock  preventing 
two-phase  locking  algorithm)  to  the  general  case  of  two-phase 
locking  where  deadlocks  are  allowed.  Under  reasonable  assumptions 
for  transaction  behavior,  it  was  found  that  the  rate  of  deadlocks 
is  proportional  to  the  average  number  of  transactions  in  the  system 
and  the  rate  of  conflicts  is  proportional  to  the  mean  of  the 
product  of  the  number  of  free  transactions  in  the  system  multiplied 
by  the  total  number  of  transactions  in  the  system. 

Section  V  describes  using  a  new  control  theoretic  analysis  that 
uses  significantly  weaker  assumptions  than  the  standard  queueing 
theoretic  approach,  thus,  its  conclusions  are  quite  general.  The 
model  presented  is  easy  to  understand  and  costs  little  to  solve 
computationally,  that  captures  the  essential  features  of  the  system 
it  models.  The  model  has  two  parts:  a  flow  diagram  and  a  set  of 
equations  describing  the  behavior  of  the  system.  The  equations  are 
derived  using  the  steady  state  average  values  of  the  variables. 
The  underlying  idea  is  to  characterize  the  system  in  terms  of  those 
average  values,  instead  of  detailed  dynamics  involving 
instantaneous  values  of  each  variable. 

These  analyses  only  studied  a  few  of  the  many  available  concurrency 
control  algorithms,  and,  therefore,  are  not  comprehensive.  They 
principally  demonstrated  the  feasibility  of  these  approaches. 

The  survey/study  of  reliability  and  recovery  of  distributed 
database  systems  that  achieved  the  fifth  objective  is  documented  in 
Sections  VI  through  IX  of  Volume  I.  The  following  paragraphs 
contain  a  brief  summary  of  the  current  taxonomy  of  recovery 
algorithms  as  well  as  a  description  of  Sections  VI-IX. 

The  recovery  algorithm  of  a  DBS  avoids  incorrect  states 
caused  by  transaction  failures  and  system  failures  by 
ensuring  that  the  database  only  includes  updates  that  are 
produced  by  transactions  that  execute  to  completion.  A 
centralized  DBS  is  modeled  as  storage,  a  scheduler,  and  a 
recovery  system. 

The  storage  component  consists  of  buffer  storage  and 
stable  storage.  Both  are  divided  into  physical  pages  of 
equal  and  fixed  size.  Buffer  storage  models  main  memory 
and  stable  storage  models  disk  memory.  The  DB  consists 
of  a  set  of  logical  pages.  A  transaction  is  a  program 
that  can  read  from  or  write  into  the  DB  and  can  issue 
four  types  of  commands:  read,  write,  commit,  and  abort. 

The  scheduler  controls  the  order  in  which  these  commands 
are  passed  to  the  recovery  system  and  guarantees  that  the 
execution  is  recoverable.  The  recovery  system  processes 
the  read,  write,  commit,  and  abort  commands  it  receives 
from  the  scheduler  and  handles  system  failures. 


Recovery  algorithms  often  store  copies  of  pages  that  were 
recently  written  on  on  an  audit  trail  (sometimes  called  a 


journal  or  log).  For  each  write  processed  by  the 
algorithm,  the  audit  trail  may  contain  the  identifier  of 
the  transaction  that  performed  the  write,  a  copy  of  the 
newly  written  page  (called  the  after-image,  and  a  copy  of 
the  physical  page  in  the  stable  database  that  was 
overwritten  by  the  write  (called  the  before-image). 
Different  algorithms  vary  considerably  in  the  information 
they  keep  on  the  audit  trail  and  in  how  they  structure 
that  information.  Recovery  algorithms  also  differ  in  the 
time  at  which  they  write  pages  into  the  stable  database. 
They  may  perform  such  writes  before,  concurrently  with, 
or  after  the  atomic  instruction  that  "ommits  the 
transaction  that  last  wrote  those  pages.  I-  x  page  that 
is  written  by  an  active  transaction  is  wr  ten  into  the 
stable  database  before  the  transaction  com  and  the 
transaction  aborts  due  to  a  system  c  transaction 
failure,  the  recovery  algorithm  must  undo  .«  write  by 
restoring  the  previous  copy  (before-image  _  the  page. 
If  a  page  that  is  written  by  an  active  transaction  is  not 
written  into  the  stable  database  before  the  transaction 
commits  and  a  system  failure  occurs  after  the  transaction 
commits,  but  before  the  page  is  written  into  the  stable 
database,  the  recovery  algorithm  must  redo  the  write  by 
moving  the  page  to  the  stable  database. 

Recovery  algorithms  can  be  categorized  based  on  the 
timing  of  updates  to  the  stable  database.  Consequently, 
there  are  four  basic  types  of  recovery  algorithms:  ones 
that  require  undo  but  not  redo,  redo  but  not  undo,  both 
undo  and  redo,  and  neither  undo  nor  redo. 


As  previously  stated,  a  DDBS  is  modeled  by  a  set  of 
transaction  managers  (TMs),  data  managers  (DMs ) ,  and 
schedulers.  A  DM  is  a  centralized  DBS  as  defined  above. 
It  processes  reads  and  writes  on  pages  stored  at  its 
site.  It  also  processes  commits  and  aborts,  which 
permanently  install  or  undo  the  writes  of  a  transaction 
at  the  DM  site.  A  TM  interfaces  transactions  and  DMs. 
Unfortunately,  TMs  and  DMs  may  fail  at  unpredictable 
times.  Each  TM  must  process  commands  so  that  failures  of 
other  TMs  and/or  DMs  never  cause  it  to  produce  incorrect 
results.  Consequently,  each  TM  keeps  an  active 
transaction  list,  a  commit  list,  and  an  abort  list  in 
stable  storage  for  reference  against  TM  failures. 

To  avoid  inconsistencies  caused  by  TM  failures,  there  are 
two  basic  commit  algorithms:  two-phase  commit  and 
three-phase  commit.  In  two-phase  commit  a  TM  does  not 
send  'commit(i)'  to  any  DM  until  every  DM  in  'active(i)' 
has  transaction  T(i)'s  after-images  on  stable  storage. 
In  three-phase  commit  each  TM  has  one  or  more  backup  TMs, 
such  that  if  a  TM  fails,  the  backup  can  take  over  its 
functions.  In  particular,  a  three-phase  commit  is  a 
two-phase  commit  with  an  added  step:  the  TM  sends  a 
'precommit ( i ) '  to  each  backup  and  waits  for  all  backups 
to  acknowledge  before  it  sends  'commit(i)'  to  each  DM  in 
'active(i)'.  Aborts  are  handled  in  a  similar  way. 


To  avoid  delay  caused  by  the  failure  of  a  DM,  the  DBS  can 
replicate  data,  that  is,  store  parts  of  the  database  at 
more  than  one  DM  site.  If  one  copy  is  unavailable  due  to 
a  DM  failure,  other  copies  can  oe  used  instead. 
Replication  of  data,  however,  introduces  another 
dimension  to  the  consistency  problem  that  involves  the 
concurrency  control  algorithms  described  elsewhere  in 
this  report. 

Section  VI  of  Volume  I  presents  a  non-mathematical  survey 
of  recovery  algorithms  and  describes  methods  for  undoing 
a  transaction  after  it  aborts,  redoing  committed 
transactions  after  a  site  fails,  and  extending  any 
centralized  undo  or  redo  algorithm  to  a  distributed 
system. 

Section  VII  presents  a  mathematical  model  of  recovery 
algorithms  (which  is  described  below)  and  the  correctness 
proof  using  this  model  for  each  of  the  algorithms 
described  in  Section  VI. 

Section  VIII  describes  a  new  algorithm  for  site  recovery 
in  a  distributed  database  system.  The  problem  is  how  to 
bring  a  site's  database  up  to  date  after  it  has  recovered 
from  failure.  The  solution  ^resented  allows  different 
portions  of  the  database  to  be  brought  up  to  date 
independently,  thereby  avoiding  a  strongly  synchronized 
bulk  transfer  of  the  entire  database  to  the  recovered 
site . 

Section  IX  describes  an  optimal  algorithm  for  undoing  a 
set  of  transactions  to  a  consistent  state.  The  problem 
is  that  when  one  transaction  is  undone,  any  transaction 
that  read  its  output  must  also  be  undone.  Thus,  undos 
may  cascade  requiring  many  other  transactions  to  be 
undone.  It  is  shown  that  there  is  a  unique  best  set  of 
transactions  to  undo,  and  a  fast  algorithm  for  finding 
that  set  is  described. 

Because  the  subject  of  reliability  and  recovery  of  DDBSs 
is  relatively  unexplored,  only  a  few  algorithms  were 
reported.  Further  research  is  needed  to  discover  new 
algorithms . 

A  framework  for  the  reliability  and  recovery  of  a  distributed 
database  system  achieving  the  sixth  objective  is  described  in 
Section  VII  of  Volume  I.  The  framework  consists  of  an  operational, 
state-based  model  for  studying  reliability  of  DBSs,  i.e.,  the 
system  is  described  at  any  point  in  time  by  a  "system  state". 
Reliability-related  properties  of  the  system  (e.g.,  "resiliency") 
can  be  expressed  as  predicates  on  the  system  state.  Transaction 
processing  algorithms  can  be  described  as  state  transition 
functions,  mapping  the  current  system  state  to  the  next.  Finally, 
correctness  and  other  reliability  properties  of  algorithms  can  be 
proved  formally  by  examining  the  system  state  sequences  that  can  be 
generated  by  the  algorithms  in  question. 


This  framework  captures  the  essential  components  of  existing 
reliability  and  recovery  algorithms.  But,  because  research  on  this 
subject  is  in  its  primitive  stages,  more  research  is  needed  to 
refine  the  framework  and  to  use  it  to  develop  more  efficient 
algorithms.  Moreover,  the  refined  framework  could  become  a  basis 
for  the  standardization  of  distributed  reliability  and  recovery 
architectures . 

Finally,  all  these  results  were  summarized  in  a  separate 
Distributed  Database  System  Designer's  Handbook  ( objective  seven), 
which  is  included  as  Volume  III.  This  handbook  can  help  the 
designer  to  select  a  distributed  concurrency  control  algorithm  that 
performs  best  in  his  system  environment.  The  following  is  a 
summary  of  those  results  and  recommendations. 

The  twelve  concurrency  control  algorithms  identified  in 
the  previous  effort  as  being  dominant  methods  were 
selected  for  further  study.  These  algorithms  are  called 

(1)  primary  site  and  primary  site  two-phase  locking, 

(2)  primary  copy  and  primary  copy  two-phase  locking, 

(3)  basic  and  basic  two-phase  locking, 

(4)  basic  and  primary  copy  two-phase  locking, 

(5)  basic  and  primary  site  two-phase  locking, 

(6)  DDM  multiple  version  and  optimistic  two-phase 
locking  (DDM), 

(7)  basic  and  optimistic  two-phase  locking, 

(8)  majority  consensus  timestamp, 

(9)  wait-die  two-phase  locking, 

(10)  basic  timestamp, 

(11)  multiple  version  timestamp,  and 

(12)  dynamic  timestamp. 

Five  of  the  twelve  were  found  to  perform  best  in  various 
system  environments:  basic  timestamp,  multiple  version 

timestamp,  DDM,  basic-optimistic  two-phase  locking,  and 
basic-primary  two-phase  locking. 

When  most  transactions  are  short,  algorithms  that  abort 
conflicting  transactions  (such  as  basic  timestamp  and 
multiple  version  timestamp)  perform  better  than 
algorithms  that  block  conflicting  transactions  (such  as 
basic-primary).  In  this  environment,  transactions 
conflict  rarely;  and  when  they  do  conflict,  the  blocking 
transactions  tend  to  be  longer  than  the  average 
transaction  size  and  blocking  delay.  If  a  two-phase 
locking  algorithm  must  be  used,  algorithms  that  delay 
lock  conflict  checking  (such  as  the  DDM  and  the 
basic-optimistic)  perform  better  than  those  that  expedite 
lock  conflict  checking  (such  as  basic-primary).  Unless 
the  communication  bandwidth  is  very  high,  communication 
delay  can  devastate  system  performance;  thus,  the 
designer  should  reduce  communication  delay  by  locally 
controlling  and  accessing  data  as  much  as  possible. 


However,  no  matter  which  concurrency  control  algorithm 
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the  designer  uses/  a  system  that  has  long  transactions 
always  perform?  worse  than  a  system  that  has  short 
transactions.  Tne  designer  should  design  transactions  to 
access  as  much  data  in  parallel  as  possible,  and  to  break 
long  transactions  into  shorter  transactions . 

The  performance  study  showed  that  no  one  algorithm 
performs  best  in  all  system  and  application  environments. 

If  the  system  environment  is  stable,  the  database 
designer  can  select  one  algorithm  that  performs  best  in 
the  environment.  If  the  system  environment  is  not 
stable,  the  database  designer  can  assign  different 
weights  to  different  environments  according  to  how  often 
the  system  stays  in  each  environment.  The  database 
designer  then  selects  the  algorithm  that  has  the  best 
weighted  average  performance. 

From  the  results,  it  can  be  concluded  that  the  best 
algorithm  would  be  one  that  could  be  adjusted  by  the 
system  administrator,  according  to  the  environment,  to 
use  transaction  abortion  and  delay  lock  conflict 
detection  whenever  transactions  are  short,  and  to  use 
transaction  blocking  and  detect  lock  conflicts  as  soon  as 
possible  whenever  transactions  are  long.  The  adjustable 
algorithm  would  also  alternate,  depending  on  the  load  on 
the  communication  channel,  between  algorithms  that  have 
more  localized  control  and  algorithms  that  have  more 
distributed  control. 

In  summary,  all  of  the  objectives  were  satisfactorily  met.  The 
next  step  would  be  to  refine  the  taxonomy  for  reliability  and 
recovery  algorithms  and  conduct  performance  evaluations  for 
existing  and  new  algorithms,  as  was  done  in  this  effort  for 
concurrency  control.  A  successive  task  would  then  be  to  translate 
the  results  of  both  evaluation  efforts  into  a  practical,  integrated 
set  of  tools  that  aid  distributed  database  designers,  and  into  a 
standard  architecture  of  distributed  DBMS  that  facilitates  the 
interconnection  of  different  DBMSs. 


This  is  the  first  volume  of  the  final  technical  report  for  the 
project  "distributed  Database  Control  and  Allocation,"  sponsored  by 
Rome  Air  Development  Center,  contract  number  F30602-81-C0028 .  This 
volume  describes  frameworks  for  understanding  concurrency  control  and 
recovery  algorithms  for  centralized  and  distributed  database  systems. 

The  second  volume  describes  work  on  the  performance  analysis  of  con¬ 
currency  control  algorithms. 

This  volume  is  an  anthology  of  papers  organized  in  two  parts. 

Part  One  covers  concurrency  control  and  consists  of  Sections  II  through 
V.  Part  Two  covers  recovery  and  consists  of  Sections  VI  through  IX. 

Section  II  presents  an  overview  of  concurrency  control  algorithms 
for  distributed  database  systems.  It  decomposes  the  concurrency  control 
problem  into  several  subproblems,  and  describes  the  known  solutions  to 
the  subproblems.  This  paper  extends  Bernstein  and  Goodman's  earlier 
survey  of  the  subject  ("Concurrency  Control  in  Distributed  Database  Systems, 
ACM  Computing  Surveys  13,2  (June  1981))  by  using  an  improved  decomposition 
into  subproblems  and  by  including  more  algorithms,  notably  certifiers  and 
multi version  algorithms. 

In  a  multiversion  concurrency  control  algorithm,  each  write  operation 

on  a  data  item,  x,  produces  a  new  "version"  of  x,  leaving  old  versions 
of  x  unchanged.  Section  III  presents  a  comprehensive  description  and  logi¬ 
cal  analysis  of  multi version  concurrency  control  algorithms.  It  extends 
serializability  theory  to  handle  the  multi version  semantics  of  "write."  It 
describes  multiversion  concurrency  control  algorithms  based  on  locking  and 
timestamping,  and  proves  them  correct  using  the  extended  theory. 

Section  IV  and  V  present  mathematical  analyses  of  two-phase  locking 
algorithms.  Section  IV  uses  a  queueing  theoretic  approach  coupled  with  a 


random  graph  analysis  of  deadlocks.  Section  V  uses  a  new  control  theoretic 
analysis  that  uses  significantly  weaker  assumptions  than  the  standard 
queueing  theoretic  approach;  thus,  its  conclusions  are  quite  general.  These 
analyses  only  study  a  few  of  the  many  available  concurrency  control  algori¬ 
thms,  and  therefore  sure  not  comprehensive.  They  principally  demonstrate  the 
feasibility  of  these  approaches,  but  leave  open  a  more  conqplete  comparative 
analysis  of  algorithms.  Such  a  comparative  analysis  using  simulation  tech¬ 
niques  appears  in  the  second  volume  of  this  report. 

Part  Two  describes  recovery  algorithms  for  database  systems.  A  non- 
mathematical  survey  of  these  algorithms  is  presented  in  Section  VI.  It  des¬ 
cribes  methods  for:  undoing  a  transaction  after  it  aborts;  redoing  committed 
transactions  after  a  site  fails;  extending  any  centralized  undo  or  redo 
algorithm  to  a  distributed  system. 

A  mathematical  model  of  recovery  algorithms  is  presented  in  Section  VII. 
Each  of  the  algorithms  described  in  Section  VI  is  proved  correct  using  this 
model. 

Section  VIII  presents  a  new  algorithm  for  site  recovery  in  a  distributed 
database  system.  Ihe  problem  is  how  to  bring  a  site's  database  up  to  date 
after  it  has  recovered  from  failure.  The  solution  allows  different  portions 
of  the  database  to  be  brought  up  to  date  independently,  thereby  avoiding  a 
strongly  synchronized  bulk  transfer  of  the  entire  database  to  the  recovered  site 

Section  IX  presents  an  optimal  algorithm  for  undoing  a  set  of  transactions 
to  a  consistent  state.  The  problem  is  that  when  one  transaction  is  undone, 
any  transaction  that  read  its  output  must  also  be  undone.  Thus,  undo's  may 
cascade  requiring  many  other  transactions  to  be  undone.  It  is  shown  that  there 
is  a  unique  best  set  of  transactions  to  undo,  and  a  fast  algorithm  for  finding 


this  set  is  described. 
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ABSTRACT 


Dozens  of  articles  have  been  published  describing  "new"  concurrency 
control  algorithms  for  distributed  database  systems  All  of  these  algo¬ 
rithms  can  be  derived  and  understood  using  a  few  basic  concepts.  We  show 
how  to  decompose  the  concurrency  control  problem  into  several  subproblems , 
each  of  which  has  just  a  few  known  solutions.  By  appropriately  combining 
known  solutions  to  the  subproblems ,  we  show  that  all  published  concurrency 
control  algorithms  and  many  new  ones  can  be  constructed.  The  glue  that 
binds  the  subproblems  and  solutions  together  is  a  mathematical  theory  known 
as  serializability  theory. 

This  paper  does  not  assume  previous  knowledge  of  distributed  database 
concurrency  control  algorithms,  and  is  suitable  for  both  the  uninitiated 


and  the  cognoscente. 
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INTRODUCTION 


A  distributed  database  system  (DDES)  is  a  database  system  (DBS)  that 
provides  commands  to  read  and  write  data  that  is  stored  at  multiple  sites 
of  a  network.  If  users  access  a  DDBS  concurrently,  they  may  interfere  with 
each  other  by  attempting  to  read  and/or  write  the  same  data.  Concurrency 
control  is  the  activity  of  preventing  such  behavior. 

Dozens  of  algorithms  that  solve  the  DDBS  concurrency  control  problem 
have  been  published  (see  [BG1]  and  the  references) .  Unfortunately,  many 
of  these  algorithms  are  so  complex  that  only  an  expert  can  understand  them. 

To  remedy  this  situation,  we  have  developed  a  simple  framework  for 
understanding  concurrency  control  algorithms.  The  framework  decomposes 
the  problem  into  subprobleros  and  gives  basic  techniques  for  solving  each 
subproblem.  To  understand  a  published  algorithm,  one  first  identifies  the 
technique  used  for  each  subproblem  and  then  checks  that  the  techniques  are 
appropriately  combined.  The  framework  can  also  be  used  to  develop  new 
algorithms  by  combining  existing  techniques  in  new  ways. 

The  paper  has  10  sections.  Sections  2  and  3  set  the  stage  by 
describing  a  simple  DDBS  architecture  and  sketching  the  framework  in  terms 
of  the  architecture.  The  framework  itself  appears  in  Sections  4-8. 

Section  9  uses  the  framework  to  explain  several  published  algorithms. 
Section  10  is  the  conclusion. 

This  paper  refines  an  earlier  survey  of  concurrency  control  algorithms 
[BG1] .  The  earlier  paper  includes  many  technical  details  that  are  omitted 
here.  Ne  urge  the  interested  reader  to  consult  [BG1]  for  more  details. 


2.  DISTRIBUTED  DBS  ARCHITECTURE 

We  use  •  simple  model  of  DDBS  structure  and  behavior.  The  model 
highlights  those  aspects  of  a  DDBS  that  are  important  for  understanding 
concurrency  control,  while  hiding  details  that  don't  affect  concurrency 
control. 

A  database  consists  of  a  set  of  data  item,  denoted  {.,.,x,y,z}.  In 
practice,  a  data  item  can  be  file,  record,  page,  etc.  But  for  the  purposes 
of  this  paper,  it's  best  to  think  of  a  data  item  as  a  simple  variable.  For 
now,  assume  each  data  item  is  stored  at  exactly  one  site. 

Users  access  data  items  by  issuing  Read  and  Write  operations.  Read(x) 
returns  the  current  value  of  x.  Write (x, new  value)  updates  the  current 
value  of  x  to  new-value. 

Users  interact  with  the  DDBS  by  executing  programs  called  transactions. 
A  transaction  only  interacts  with  the  outside  world  by  issuing  Reads  and 
Writes  to  the  DDBS  or  by  doing  terminal  I/O.  We  assume  that  every  trans¬ 
action  is  a  complete  and  correct  computation;  each  transaction,  if  executed 
alone  on  an  initially  consistent  database,  would  terminate,  produce  correct 
results,  and  leave  the  database  consistent. 

Each  site  of  a  DDBS  runs  one  or  more  of  the  following  software  modules 
(see  figures  1  and  2) :  a  transaction  manager  (TM) ,  a  data  manager  (DM)  or 
a  scheduler.  Transactions  talk  to  TM's;  TM's  talk  to  schedulers;  schedulers 
talk  among  themselves  and  also  talk  to  DM's;  and  DM's  manage  the  data. 

Each  transaction  issues  all  of  its  Reads  and  Writes  to  a  single  TM. 

A  transaction  also  issues  a  Begin  operation  to  its  TM  when  it  starts 
executing  and  an  End  when  it's  finished. 

The  TM  forwards  each  Read  and  Write  to  a  scheduler.  (Which  scheduler 
depends  on  the  concurrency  control  algorithm;  usually,  the  scheduler  is  at 
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the  same  site  as  the  data  being  read  or  written.  In  some  algorithms, 

Begins  and  Ends  are  also  sent  to  schedulers.) 

The  scheduler  controls  the  order  in  which  DM's  process  Reads  and 
Writes.  When  a  scheduler  receives  a  Read  or  Write  operation,  it  can 
either  output  the  operation  right  away  (usually  to  a  DM,  sometimes  to 
another  scheduler) ,  delay  the  operation  by  holding  it  for  later  action,  or 
reject  the  operation.  A  rejection  causes  the  system  to  abort  the  trans¬ 
action  that  issued  the  operation:  every  Write  processed  on  behalf  of  the 
transaction  is  undone  (restoring  the  old  value  of  the  data  item) ,  and 
every  transaction  that  read  a  value  written  by  the  aborted  transaction  is 
also  aborted.  This  phenomenon  of  one  abort  triggering  other  aborts  is 
called  cascading  aborts.  (It  is  usually  avoided  in  commercial  DBS's  by 
not  allowing  a  transaction  to  read  another  transaction's  output  until  the 
DBS  is  certain  that  the  latter  transaction  will  not  abort.  In  this  paper, 
we  will  not  try  to  prevent  cascading  aborts.)  This  paper  does  not  discuss 
techniques  for  implementing  abort.  See  [GMBL,  HS,  LS] . 

The  DM  executes  each  Read  and  Write  it  receives.  For  Read,  the  DM 
looks  in  its  local  database  and  returns  the  requested  value.  For  Write, 
the  DM  modifies  its  local  database  and  returns  an  acknowledgment.  The  DM 
sends  the  returned  value  or  acknowledgment  to  the  scheduler,  which  relays 
it  back  to  the  TM,  which  relays  it  back  to  the  transaction. 

DM's  do  not  necessarily  execute  operations  first-come-first-served. 

If  a  DM  receives  a  Read(x)  and  a  Write (x)  at  about  the  same  time,  the 
EM  is  free  to  execute  these  operations  in  either  order.  If  the  order 
matters,  (as  it  probably  does  in  this  case),  it  is  the  scheduler's  responsi¬ 
bility  to  enforce  the  order.  This  is  done  by  using  a  hor,J’eh.  ’  ing  communica 
tion  discipline  between  schedulers  and  DM's  (see  figure  3):  if  the 


To  execute  Read(x)  on  behalf  of  transaction  1 


followed  by  Write (x)  on  behalf  of  transaction  2 


Scheduler 


DM 


send  Read(x) 


receive 

execute 


Read (x) 
Read (x) 


Write (x) 
Write (x) 


send  ask 


Figure  3 


Handshaking 


scheduler  wants  Read(x)  to  be  executed  before  Write(x),  it  sends 
Read(x)  to  the  DM,  waits  for  the  DM's  response,  and  then  sends  Hrite(x). 
Thus  the  scheduler  doesn't  even  send  Write (x)  to  the  DM  until  it  knows 
Read(x)  was  executed.  Of  course,  when  the  execution  order  doesn't  natter 
the  scheduler  can  send  operations  without  the  handshake. 

Handshaking  is  also  used  between  other  nodules  when  execution  order 


is  inportant. 


3 


THE  FRAMEWORK 


The  DDBS  modules  most  important  to  concurrency  control  are  schedulers. 

A  concurrency  control  algorithm  consists  of  some  number  of  schedulers,  running 
seme  type  of  scheduling  algorithm,  in  a  centralized  or  distributed  fashion. 

In  addition,  the  concurrency  control  algorithm  must  handle  "replicated 
data"  somehow.  (TM's  often  handle  this  problem.) 

To  understand  a  concurrency  control  algorithm  using  our  framework  one 
determines 

(i)  the  type  of  scheduling  algorithm  used  (discussed  in  Sections 
5  and  8) , 

(ii)  the  location  of  the  scheduler(s) ,  i.e.  centralized  vs. 
distributed  (Section  6) ,  and 
(iii)  how  replicated  data  is  handled  (Section  7) . 

The  framework  also  includes  rules  that  tell  when  a  concurrency  control 
algorithm  is  correct.  These  rules  give  precise  conditions  under  which  a 
DDBS  produces  correct  executions.  These  rules,  called  serializability 
theory,  are  discussed  in  the  next  section. 


4. 


SERIALIZABILITY  THEORY 


Serializability  theory  is  a  collection  of  mathematical  rules  that  tell 
whether  a  concurrency  control  algorithm  works  correctly  [BSW,  Casa,  EGLT, 
Papa,  PBR,  SLR].  Serializability  theory  does  its  job  by  looking  at  the 
executions  allowed  by  the  concurrency  control  algorithm.  The  theory  gives 
a  precise  condition  under  which  an  execution  is  correct.  A  concurrency 
control  algorithm  is  then  judged  to  be  correct  if  all  of  its  executions 
are  correct. 

4.1  Logs 

Serializability  theory  models  executions  by  a  construct  called  a  log. 

A  log  identifies  the  Read  and  Write  operations  executed  on  behalf  of  each 
transaction,  and  tells  the  order  in  which  those  operations  were  executed. 
Following  Lamport,  we  allow  an  execution  order  to  be  a  partial  order  [Lamp] 

A  transaction  log  represents  an  allowable  execution  of  a  single  trans¬ 
action.  Formally,  a  transaction  log  is  a  partially  ordered  set  (poset) 

T^  *  ,<i>  where  is  the  set  of  Reads  and  Writes  issued  by  (an 

execution  of)  transaction  i,  and  tells  the  order  in  which  those 

operations  must  be  executed.  We  write  transaction  logs  as  diagrams. 

r  IxJ  ^ 

T  =  w  [x] 

1  r^z]  A 

T,  represents  a  transaction  that  reads  x  and  z  in  parallel,  and  then 
writes  x.  (Presumably,  the  value  written  depends  on  the  values  read.) 

Let  ' T  ■  {Tq,...,Tr}  be  a  set  of  transaction  logs.  A  DDBS  log  (or 
simply  a  log)  over  T  represents  an  execution  of  T0,...,Tn.  Formally,  a 
log  over  T  is  a  poset  L*  (I,<)  where 


Conditions  (1)  states  that  the  DDBS  executed  all,  and  only,  the  operations 


submitted  by  T  , ...,T  .  Condition  (2)  states  that  the  DDBS  honored  all 
o  n 

operation  orderings  stipulated  by  the  transactions. 

The  following  are  all  possible  logs  over  the  example  transaction  log 
from  above. 

r.  [x] 

1.  ''*- 

(1)  {  w1lx] 

r^z) 

rllx]  ^ 

(2)  ♦  w.  lx] 

r  [*] 


(3) 


r^lx] 

r^s] 


w^x] 


Notice  that  the  DDBS  is  not  required  to  process  Read(x)  and  Read(z)  in 
parallel,  even  though  T^  allows  this  parallellism.  However,  the  DDBS 
is  not  allowed  to  reverse  or  eliminate  any  ordering  stipulated  by  . 

The  following  is  not  a  log  over  T^ 

r.  lx] 

1  V. 

(4)  w  lx] 

x 

(z] 

because  it  reverses  the  order  in  which  T.  reads  and  writes  x. 


There  is  one  further  constraint  on  the  form  of  logs.  Two  operations 

f 

oonfliot  if  they  operate  on  the  sane  data  iten  and  (at  least)  one  of  them 
is  a  Write.  To  ensure  that  logs  represent  unique  computations,  we  require 
that  all  pairs  of  conflicting  operations  be  ordered.  This  constraint 
applies  to  transaction  logs  as  well  as  DDBS  logs. 

We  use  r^[x]  (resp. ,  w^[xj)  to  denote  a  Read  (resp. ,  Write)  on  x 
issued  by  T^.  To  keep  this  notation  unambiguous,  we  assume  that  no  trans 
action  reads  or  writes  a  data  item  more  than  once. 


Given  transaction  logs 


w0lxl 


w0[yJ 

Vzl 


T2  "  r2Ixl  W2IyJ 


rlW 

w  [x] 
x 

r^z)  ^ 


T3  -  r3lz) 


y3W 


'w3[z] 


the  following  is  a  log  over  ^T0'Tl'T2’T3^* 


rltxJ 

V\ 


rx[z] 


w0(yl 


wQ[z) 


y^lx]^w2[y]> 
>Ss*r_  Iz) - ►wily) 


►W3U5 


(Note  that  orderings  implied  by  transitivity  are  usually  not  drawn.  E.g. 
v^ly)  <w3Iy)  is  not  drawn  in  the  diagram,  although  it  follows  from 
w  [y]  <w2tyl  Wjlyl  <  w3Iy] .) 


Let  L  be  a  log  over  sob*  set  T,  and  suppose  w^[x]  end  r ^ [x]  ere 
operetions  in  L.  Me  eey  r^lx]  reads- from  w^x]  if  v»i[x]<r^tx]  end 
no  w^tx]  fells  between  r^x]  and  w^  lx] .  In  this  log 

wQ  Ixl  -*>  rA  lx]  -♦  w2  lx]  r3  lx]  ■*  lx] 

r^lx]  reeds-froej  wQ  [x] ,  end  r^fx]  end  r^lx]  read-front  w2(x].  Me 
cell  w^lx]  a  final-write  in  L  if  no  w^lx]  follows  it.  In  this  log 

wQ  lx]  *►  v1  lx]  w2  ly ]  -*■  r2  ly ] 

w^lx]  end  w2ly]  ere  final-writes. 

Intuitively,  two  logs  over  T  ere  equivalent  if  they  represent  the  same 
computation.  Formally,  two  logs  over  T  are  equivalent  if 

(1)  each  Read  reads-from  the  same  Write  in  both  logs,  and 

(2)  they  have  the  same  final-writes. 

Condition  (1)  ensures  that  each  transaction  reads  the  same  values  from  the 
database  in  each  log.  Condition  (2)  ensures  that  the  same  transaction 
writes  the  final  value  of  a  given  data  item  in  both  logs. 

The  following  log  X>2  is  equivalent  to  log  of  Section  4.2. 

L2  “  wqMW0  lylwo^*^r2  lx^w2  ^ri  ^x^ri  ^wi  lxJr3 1*3 w3  ly ] w3 1*)  • 

(When  we  write  a  log  as  a  sequence,  e.g.  we  mean  that  the  log  is  totally 

ordered:  each  operation  precedes  the  next  one  and  all  subsequent  ones  in 
the  sequence.  Thus,  in  L,,  w_lx]  < w-ly]  < w_l*l  < r,lx]  ...  .) 


4.3  Serializable  Logs 


A  aerial  log  is  a  total  order  on  I  such  that  for  every  pair  of 
transactions  and  TV,  either  all  of  TV 's  operations  precede  all  of 

Tj's,  or  vice  versa  (e. g. ,  L^  in  Section  4.2).  A  serial  log  represents 
an  execution  in  which  there  is  no  concurrency  whatsoever;  each  transaction 
executes  from  beginning  to  end  before  the  next  transaction  begins.  From 
the  point  of  view  of  concurrency  control,  therefore,  every  serial  log 
represents  an  obviously  correct  execution. 

What  other  logs  represent  correct  executions?  From  the  point  of  view 
of  concurrency  control,  a  correct  execution  is  one  in  which  concurrency 
is  invisible.  That  is,  an  execution  is  correct  if  it  is  equivalent  to  an 
execution  in  which  there  is  no  concurrency.  Serial  logs  represent  the 
latter  executions,  end  so  a  correct  log  is  any  log  equivalent  to  a  serial 
log.  Such  logs  are  termed  serializable  (SR).  Log  of  Sec.  4.1  is  SR, 
because  it  is  equivalent  to  serial  log  L2  of  Sec.  4.2.  Therefore  L^  is 
a  correct  log. 

Serializability  theory  is  the  study  of  serializable  logs. 

4.4  The  Serializability  Theorem 

This  section  presents  the  main  theorem  of  serializability  theory.  Later 
sections  rely  on  this  theorem  to  analyze  concurrency  control  algorithms. 

This  theorem  uses  a  graph  derived  from  a  log,  called  a  serialization  graph. 

Suppose  L  is  a  log  over  {Tq,...,^}.  The  serialization  graph  for 
L,  SG (L) ,  is  a  directed  graph  whose  nodes  are  TQ,...,Tn,  and  whose  edges 
are  all  T^^T^  such  that,  for  seme  x,  either  (i)  r^[x]<w^(x),  or 
(ii)  w.  tx]<r.(x],  or  (iii)  w.  lx]  <  w .  (x) .  The  serialization  graphs  for 


example  log  is 


SG (L^) 


Edge  Tq is  present  because  WglxMr^x].  Edge  T^T^  is  caused 
by  r2[x]<w1[x].  Edge  T2-*-T3  arises  from  w2[y]<w3ly].  And  so  forth. 

SERIALIZABILITY  THEOREM.  If  SG  (L)  IS  CLOyolic  then  L  X8  SR.  O 
For  example,  since  SGIL^)  is  acyclic,  is  SR. 


We  can  also  use  the  Serializability  Theorem  to  determine  if  a  scheduler 
produces  SR  logs.  First,  we  characterize  the  logs  produced  by  the  scheduler 
Then  we  prove  that  every  such  log  has  an  acyclic  SG  [BSW,  Papa]. 

Same  concurrency  control  algorithms  schedule  read-write  conflicts 
separately  from  write-write  conflicts.  It  is  easier  to  analyze  such  algo¬ 
rithms  using  a  restatement  of  the  Serializability  Theorem.  Define  the 
read-write  serialization  graph  for  L,  SG_  (L) ,  as  follows:  SG__  (L)  has 
nodes  TQ,...,Tn  and  edges  T^T^  such  that,  for  some  x,  either 

(i)  r.  [x)<w.lx],  or  (ii)  w.{x]<r.[x].  In  other  words,  SG  (L)  is  like 
—  j  ^  ]  — — 

SG(L)  except  we  don't  care  about  write-write  conflicts.  The  write-write 

serialization  graph  for  L,  SGw (L) ,  is  defined  analogously:  the  nodes  are 

T-,...,T  ,  and  the  edges  are  T.  -*■  T .  such  that,  for  some  x,  w.  [x]  <  w  [x] . 
on  i  j  *  j 


“rv'V 


se  (l,  ) 

WW  1 


T, 

X*  1 

To - -y3 

>iT2 


Of  course,  SG(L)  »  SG  (L)  U  SG  (L)  . 
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RESTATED  SERIALIZABIL3TY  THEOREM  [BG1J .  If  the  following  four 
conditions  holdt  then  L  ie  sr 


(i) 

SG 

(L) 

rw 

(ii) 

SG 

(L) 

ww 

(iii) 

For  all 

either  T^  precedes  T ^  in  SG^  (L) ,  or  there  is  no  path 


between  them  in  SG  (L) . 

ww 


(iv)  For  all  T.  and  T.  *  if  T.  precedes  T  in  SG  (L)  then 

X  J  X  j  ww 

either  precedes  T^  in  SG^tL),  or  there  is  no  path 


between  them  in  SG  (L) . 

rw 


Conditions  (i)-(iv)  are  just  another  way  of  saying  that  SG(L)  is 


acyclic.  The  conditions  allow  us  to  analyze  the  correctness  of  read-write 
(rw)  scheduling  almost  independently  of  write-wirte  (ww)  scheduling. 


5.  SCHEDULERS 

There  ere  four  types  of  schedulers  for  producing  SR  executions:  two- 
phase  locking,  timestamp  ordering,  serialization  graph  checking  and 
certifiers.  Each  type  of  scheduler  can  be  used  to  schedule  rv  conflicts, 
ww  conflicts,  or  both.  This  section  describes  each  type  of  scheduler 
assuming  it  is  used  for  both  kinds  of  conflict.  Ways  of  combining  scheduler 
types  (e.g.  two-phase  locking  for  rw  conflicts  and  timestamp  ordering 
for  ww  conflicts)  are  described  in  Section  9.  This  section  also  assumes 
that  the  scheduler  runs  at  a  single  site,  see  figure  4;  Section  5  lifts 
this  restriction. 

5.1  Two-Phase  Locking 

A  two-phase  locking  (2PL)  scheduler  is  defined  by  three  rules  [EGLTJ : 

i.  Before  outputting  r^[x]  (resp.  w^x]),  set  a  read- lock  (resp. 
write-lock)  for  T^  on  x.  The  lock  must  be  held  (at  least) 
until  the  operation  is  executed  by  the  appropriate  DM.  (Hand¬ 
shaking  can  be  used  to  guarantee  that  locks  are  held  long  enough.) 

ii.  Different  transactions  cannot  simultaneously  hold  "conflicting" 
locks.  Two  locks  conflict  if  they  are  on  the  same  data  item  and 
(at  least)  one  is  a  write- lock.  If  rw  and  ww  scheduling  is 
done  separately,  the  definition  of  "conflict"  is  modified.  For 
rw  scheduling,  two  locks  on  the  same  data  item  conflict  if 
exactly  one  is  a  write-lock;  i.e.,  write- locks  don't  conflict 
with  each  other.  For  ww  scheduling,  both  locks  must  be  write- locks 


iii.  After  releasing  a  lock,  a  transaction  cannot  obtain  any  more  locks. 


Rule  (iii)  causes  locks  to  be  obtained  in  a  two-phase  manner.  During 
its  growing  phase,  a  transaction  obtains  locks  without  releasing  any.  By 
releasing  a  lock,  the  transaction  enters  its  shrinking  phase  during  which 
it  can  only  release  locks.  Rule  (iii)  is  usually  implemented  by  holding 
all  of  a  transaction's  locks  until  it  terminates. 

2PL  THEOREM.  A  2PL  scheduler  only  produces  SR  logs. 

Proof  Sketch.  Consider  a  log  L  produced  by  a  2PL  scheduler.  If 
T^**T  is  in  SG(L),  then  released  some  lock  before  T^  obtained 

that  lock.  If  there's  a  nonempty  path  in  SG(L)  from  ta  to  T.  (i.e., 
a  cycle)  then,  by  transitivity,  T^  released  a  lock  before  T.  obtained 
some  lock,  thereby  breaking  rule  (iii).  So,  SG(L)  is  acyclic.  By  the 
Serializability  Theorem,  this  implies  that  L  is  SR.  □ 

Due  to  rule  (ii) ,  an  operation  received  by  a  scheduler  may  be  delayed 
because  another  transaction  already  owns  a  conflicting  lock.  Such  blocking 
situations  can  lead  to  deadlock.  For  example,  suppose  r^x)  and  r2ty] 
set  read- locks,  and  then  the  scheduler  receives  w^ly]  and  w2[x).  The 
scheduler  cannot  set  the  write-lock  needed  by  w^[y)  because  T2  holds  a 
read- lock  on  y.  Nor  can  it  set  the  write-lock  for  w2lx]  because  T^ 
holds  a  read- lock  on  x.  And,  neither  T^  nor  T2  can  release  its  read- 
lock  before  getting  the  needed  write-lock  because  of  rule  (iii).  Hence, 
we  have  a  deadlock:  T^  is  waiting  for  T2  which  is  waiting  for  T^. 

Deadlocks  can  be  characterized  by  a  waits-for  graph  [Holt,  KC] ,  a 
directed  graph  whose  nodes  represent  transactions  and  whose  edges  represent 
waiting  relationships:  edge  T^-*-T^  means  T^  is  waiting  for  a  lock 
owned  by  T^.  A  deadlock  exists  if  and  only  if  (iff)  the  waits-for  graph 
has  a  cycle.  E.g.,  in  the  above  example  the  waits-for  graph  is 


A  popular  way  of  handling  deadlock  is  to  maintain  the  waits-for  graph 
and  periodically  search  it  for  cycles.  (See  [Chap.  5,  AHU]  for  cycle 
detection  algorithms.)  When  a  deadlock  is  detected,  one  of  the  trans¬ 
actions  on  the  cycle  is  aborted  and  restarted,  thereby  breaking  the  dead¬ 
lock. 


5.2  Timestamp  Ordering 

In  timestamp  ordering  (T/0)  each  transaction  is  assigned  a  globally 
unique  timestamp  by  its  TM.  (See  [BG1,  Thom]  for  how  this  is  done.)  The 
TM  attaches  the  timestamp  to  all  operations  issued  by  the  transaction.  A 
T/0  scheduler  is  defined  by  a  single  rule:  Output  all  pairs  of  conflicting 
operations  in  timestamp  order.  Make  sure  conflicting  operations  are 
executed  by  DMs  in  the  order  they  were  output.  (Handshaking  can  be  used 
to  make  sure  of  this.)  As  for  2PL,  the  definition  of  "conflicting  operation" 
is  modified,  if  rw  and  ww  scheduling  are  done  separately. 

T/o  THEOREM.  A  T/0  scheduler  only  produces  SR  logs. 

Proof  Sketch.  Since  every  pair  of  conflicting  operations  is  in  time- 
stamp  order,  each  edge  T^T^  in  SG  has  TS(T^)  <TS(T^) 

(where  TS (T^)  is  the  timestamp  of  T^) .  Thus,  SG  cannot  have  any  cycles. 
So,  by  the  Serializability  Theorem,  the  log  produced  by  the  scheduler  is 
SR.  O 

Several  varieties  of  T/0  schedulers  have  been  proposed.  We  only  sketch 
these  variations  here.  Full  details  appear  in  [BG1] . 


A  basic  T/0  scheduler  outputs  operations  in  essentially  first-come- 
first-served  order,  as  long  as  the  T/0  scheduling  rule  holds.  When  the 
scheduler  receives  r^[x]  it  does  the  following. 

if  TS(i)  <  largest  timestamp  of  any  Write  on  x  yet  "accepted” 
then  reject  r^[x] 

else  "accept"  r^X]  and  output  it  as  soon  as  all  Writes  on  x  with 
smaller  timestamp  have  been  acknowledged  by  the  DM. 

When  the  scheduler  receives  w^[x]  it  behaves  as  follows. 

if  TS(i)  <  largest  timestamp  of  any  Read  or  Write  on  x  yet  "accepted" 
then  reject  w^Ix] 

else  "accept"  w^[x]  and  output  it  as  soon  as  all  Reads  and  Writes  on 
x  with  smaller  timestamp  have  been  acknowledged  by  the  DM. 

A  conservative  T/0  scheduler  avoids  rejections  by  delaying  operations 
instead.  An  operation  is  delayed  until  the  scheduler  is  sure  that  outputting 
it  will  cause  no  future  operations  to  be  rejected.  Conservative  T/0  requires 
that  each  scheduler  receive  Reads  and  Writes  from  each  TM  in  timestamp  order. 
To  output  any  operation,  the  scheduler  must  have  an  operation  from  each  TM 
in  its  "input  queue."  The  scheduler  then  "accepts"  the  operation  with 
smallest  timestamp.  "Accept"  means  remove  the  operation  from  the  input 
queue,  and  output  it  as  soon  as  all  conflicting  operations  with  smaller 
timestamp  have  been  acknowledged  by  the  DM.  Variations  on  conservative 
T/O  are  discussed  in  [BG1,  BSR] . 

Basic  T/O  and  conservative  T/0  are  endpoints  of  a  spectrum.  Basic 
T/0  delays  operations  very  little,  but  tends  to  reject  many  operations. 
Conservative  T/0  never  rejects  operations,  but  tends  to  delay  them  a  lot. 

One  can  imagine  T/0  schedulers  between  these  extremes.  To  our  knowledge, 
no  one  has  yet  proposed  such  a  scheduler. 


Thomas'  write  rule  (TWE)  is  a  technique  that  reduces  delay  and 
rejection  [Thom] .  TWR  can  only  be  used  to  schedule  Writes ,  and  needs  to 
be  combined  with  basic  or  conservative  T/0  to  yield  a  complete  scheduler. 


If  we're  only  interested  in  ww  scheduling,  TWR  is  simple.  When  the  schedule 


receives  w^[x]  it  does  the  following. 


if  TS(i)  <  largest  timestamp  of  any  Write  or  x  yet  "accepted" 


then  "pretend"  to  execute  w^[x] — i.e. ,  send  an  acknowledgement 


back  to  the  TM,  but  don't  send  the  Write  to  the  DM 


else  "accept"  w^x]  and  process  it  as  usual. 


The  basic  T/O-TWR  combination  works  like  this.  Reads  are  processed 


exactly  as  in  basic  T/O.  But  when  the  scheduler  receives  a  w^lx],  it 


combines  the  basic  T/O  rule  and  TWR  as  follows. 

if  TS(i)  <  largest  timestamp  of  any  Read 
on  x  yet  "accepted" 
then  reject  wi  lx] 


rw  scheduling 
(basic  T/O) 


else  if  TS(i)  <  largest  timestamp  of  any  Write 
on  x  yet  "accepted" 
then  "pretend"  to  execute  W. [x] 


ww  scheduling 
(TWR) 


else  "accept"  the  w^^  lx]  and  output  it  as  soon  as  all  operations 


on  x  with  smaller  timestamp  has  been  acknowledge  by  the  DM. 
The  conservative  T/O-TWR  combination  is  described  in  [BG1 ) . 


5.3  Serialization  Graph  Checking 

This  type  of  scheduler  works  by  explicitly  building  a  serialization 
graph,  SG,  and  checking  it  for  cycles.  Like  basic  T/O,  an  SG  checking 
scheduler  never  delays  an  operation  (except  for  handshaking  reasons) . 
Rejection  is  the  only  action  used  to  avoid  incorrect  logs. 


An  SG  checking  scheduler  is  defined  by  the  following  rules. 

1.  When  transaction  7\  Begins,  add  node  T^  to  SG. 

ii.  When  a  Read  or  Write  from  3\  is  received,  add  all  edges  T . T\ 

such  that  is  a  node  of  SG,  and  the  scheduler  has  already  output  a 

conflicting  operation  from  1\ .  As  for  the  previous  schedulers,  the 
definition  of  "conflicting  operation"  is  modified  if  rw  and  ww  conflicts 
are  scheduled  separately. 

iii.  If  after  step  (ii)  SG  is  still  acyclic,  output  the  operation. 

Hake  sure  that  conflicting  operations  are  executed  by  DM's  in  the  order  they 
were  output.  (Handshaking  can  be  used  for  this.) 

iv.  If  after  (ii)  SG  has  become  cyclic,  reject  the  operation.  Delete 
node  T.  and  all  edges  T.  T .  or  T. -*-T.  from  SG.  (SG  is  now  acyclic 

A  J  J  A 

again. ) 

SG  CHECKER  theorem.  An  SG  checking  scheduler  only  -produces  SR  logs. 

Proof  sketch.  Every  log  produced  by  the  scheduler  has  an  acyclic  SG. 
So,  by  the  Serializibility  Theorem,  every  log  is  SR.  o 

One  technical  problem  with  SG  checkers  is  that  a  transaction  must 
remain  in  SG  even  after  it  has  terminated.  A  transaction  can  only  be 
deleted  from  SG  when  it  is  a  source  node  of  the  graph,  i.e.,  when  it  has 
no  incoming  edges.  See  [Casa]  for  a  discussion  of  this  problem  and 
techniques  for  efficiently  encoding  information  about  terminated  trans¬ 
actions  that  remain  in  SG. 


5.4  Certifiers 


The  tern  "certifier"  refers  to  a  scheduling  philosophy,  not  a  specific 
scheduling  rule.  A  certifier  is  a  scheduler  that  makes  its  decisions  on  a 
per-transaction  basis.  When  a  certifier  receives  an  operation,  it  inter¬ 
nally  stores  information  about  the  operation  and  outputs  it  as  soon  as  all 
earlier  conflicting  operations  have  been  acknowledged.  When  a  transaction 
ends,  its  TM  sends  the  End  operation  to  the  certifier.  At  this  point,  the 
certifier  checks  its  stored  information  to  see  if  the  transaction  executed 
serializably.  If  it  did,  the  certifier  certifies  the  transaction,  allowing 
it  to  terminate;  otherwise,  the  certifier  aborts  the  transaction. 

All  of  the  earlier  schedulers  can  be  adapted  to  work  as  certifiers. 

Here  is  an  SG  checking  certifier.  When  the  certifier  receives  an  operation, 
it  adds  a  node  and  some  edges  to  SG  as  ejqplained  in  the  previous  section. 

The  certifier  does  not  check  for  cycles  at  this  time.  When  a  transaction, 

Ti#  ends,  the  certifier  checks  SG  for  cycles.  If  T^  does  not  lie  on  a 
cycle,  it  is  certified;  otherwise  it  is  aborted. 

SG  CERTIFIER  theorem.  An  SG  checking  certifier  only  produces  SR  logs. 

Proof  sketch.  Consider  any  "completed"  log  produced  by  the  certifier. 
Completed  means  that  all  uncertified  transactions  are  aborted  and  removed 
from  the  log.  (As  always,  any  transaction  that  read  data  written  by  an 
aborted  transaction  iB  also  aborted;  this  may  include  some  certified 
transaction.)  The  completed  log  has  an  acyclic  serialization  graph.  So 
by  the  Serializability  Theorem,  the  log  is  SR.  o  - 

Here  is  a  2PL  certifier  [Thom,  KR] .  Define  a  transaction  to  be  active 
from  the  time  the  certifier  receives  its  first  operation  until  the  certifier 
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processes  its  End.  The  certifier  stores  two  sets  for  each  active  trans¬ 
action  T^  s 

T^'s  readset,  RS(i)-{xjthe  certifier  has  output  r^tx]} 

TVs  write  set,  WS(i)«{x|the  certifier  has  output  w^lx]}. 

The  certifier  updates  these  sets  as  it  receives  operations.  When  the 
certifier  receives  End^,  it  runs  the  following  test. 

Let  RS (active)  *  U(RS(j),  such  that  T^  is  active,  but  j^i) 

WS (active)  ■  U(WS(j),  such  that  T^  is  active,  but  j  i) 
if  RS(i)  nws  (active)  f  (8,  or 

WS(i)  n  (RS (active)  U WS (active))  +  0 
then  certify  T^ 
else  abort  T^ . 

This  amounts  to  pretending  that  transactions  hold  imaginary  locks  on 
their  readsets  and  writesets.  When  transaction  T^  ends,  the  certifier 
sees  if  TVs  imaginary  locks  conflict  with  the  imaginary  locks  held  by 
other  active  transactions.  If  there  is  no  conflict,  T^  is  certified;  else 
T^  is  aborted. 

2PL  CERTIFIER  THEOREM.  A  2PL  certifier  only  produces  SR  logs. 

Proof  sketch.  Consider  a  completed  log  L  produced  by  the  certifier. 

If  T.  -*-T.  is  in  SG(L),  then  since  both  T.  and  T.  were  certified,  the 
i  D  1  3 

certifier  processed  End^  before  End ^ .  If  there's  a  nonempty  path  in 

SG(L)  from  ^  to  T^  (i.e.,  a  cycle)  then,  by  transitivity,  the 

certifier  processed  End^  before  End^.  This  is  absurd.  So,  SG(L)  is 

acyclic,  and  by  the  Serializability  Theorem,  L  is  SR. 


o 


T/0  certifiers  are  also  possible.  To  our  knowledge,  no  one  has 
proposed  this  algorithm  yet. 

Certifiers  can  also  be  built  that  check  for  serializable  executions 
during  transactions'  executions,  not  just  at  the  end.  The  extra  version 
of  this  idea  is  to  check  for  serializability  on  every  operation.  At  this 
extreme,  the  certifier  reduces  to  a  "normal"  scheduler. 
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6.  SCHEDULER  LOCATION 

The  schedulers  of  Sect,  on  5  can  be  modified  to  work  in  s  distributed 
manner.  Instead  of  one  sc) eduler  for  the  whole  system,  we  now  assume  one 
scheduler  per  DM  (refer  back  to  figure  1).  The  scheduler  normally  runs  at 
the  same  site  as  the  DM,  and  schedules  all  operations  that  the  DM  executes. 

The  new  issue  in  this  setting  is  that  the  distributed  schedulers  must 
cooperate  to  attain  the  scheduling  rules  of  Section  5. 

The  main  problem  caused  by  distributing  schedulers  is  the  maintenance 
of  global  data  structures.  Distributed  2PL  schedulers  need  a  global 
waits-for  graph.  Distributed  SG  checkers  need  a  global  SG.  In  distributed 
T/0  scheduling,  no  global  data  structures  are  needed;  each  scheduler  can 
make  its  scheduling  decisions  using  local  copies  of  R-TS(x)  and  W-TS(x) 
for  each  x  at  its  DM.  Distributed  certifiers  generally  manifest  the 
same  problems  as  their  corresponding  schedulers. 


Distributed  Two  Phase  Lockinc 


Refer  to  the  2pl  scheduling  rules  of  Section  5.1.  Rules  (i)  and  (ii) 

are  "local."  The  scheduler  for  data  item  x  schedules  all  operations  on 

x.  Hence  this  scheduler  can  set  all  locks  on  x.  Rule  (iii)  requires  a 

small  amount  of  inter-scheduler  cooperation:  no  scheduler  can  obtain  a 

lock  for  transaction  T^  after  any  scheduler  releases  a  lock  for  T^. 

This  can  be  done  by  handshaking  between  TMs  and  schedulers.  When 

Ends,  its  TM  waits  until  all  of  T^'s  Reads  and  Writes  are  acknowledged. 

At  this  point  the  TM  knows  that  all  of  T^'s  locks  are  set,  and  it's  safe 

to  release  locks.  The  TM  forwards  End^  to  the  schedulers  which  then 

release  T.'s  locks, 
x 


One  problem  with  distributed  2P1>  is  that  multi-site  deadlocks  are 

possible.  Suppose  x  and  y  are  stored  at  sites  A  and  B,  respectively 

Suppose  r^[x]  is  processed  at  A,  setting  a  read-lock  on  x  for 

A;  and  r^ [y]  is  processed  at  B,  setting  a  read-lock  on  y  for  a 

B.  If  w.[x]  and  w. [y]  are  now  issued,  a  deadlock  will  result;  T. 

3  A  3 

be  waiting  for  T^  to  release  its  lock  on  x  at  A  and  T^  will  be 
waiting  for  T_.  to  release  its  lock  on  y  at  B.  Unfortunately,  the 
deadlock  isn't  apparent  by  looking  at  site  A  or  B  alone.  Only  by 
taking  the  union  of  the  waits-for  graphs  at  both  sites  does  the  deadlock 
cycle  materialize. 

See  [MM,  Ston,  GISh,  RSL]  for  solutions  to  this  problem. 


6.2  Distributed  Timestamp  Orderin 


T/0  schedulers  are  easy  to  distribute,  because  the  T/0  scheduling 
rule  of  Section  5.2  is  inherently  local.  Consider  a  basic  T/O  scheduler 
for  data  item  x.  To  process  an  operation  on  x,  the  scheduler  only  needs 
to  know  if  a  conflicting  operation  with  larger  timestamp  has  been  accepted. 
Since  the  scheduler  handles  all  operations  on  x,  it  can  make  this  decision 
itself. 


6.3  Distributed  Serialization  Graph  Checkin 


SG  checkers  are  harder  to  distribute  than  the  other  scheduler  because 
the  serialization  graph,  SG,  is  inherently  global.  A  transaction  that 
accesses  data  at  a  single  site  can  become  involved  in  a  cycle  spanning 
many  sites.  See  [Casa]  for  a  discussion  of  this  problem. 
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6.4  Distributed  Certifiers 

Distributed  certifiers  have  a  synchronization  requirement  a  little  like 
rule  (iii)  of  2PL:  T^'s  TM  must  not  send  End^  to  any  certifier ,  until 
all  of  T^'s  Reads  and  Writes  have  been  acknowledged.  I . e. ,  we  must  not 
try  to  certify  at  any  site  until  we  are  ready  to  certify  at  all 

sites. 

Beyond  this,  each  distributed  certifier  behaves  like  the  corresponding 
scheduler.  A  distributed  2PL  certifier  needs  little  inter-scheduler 
cooperation  (beyond  the  previous  paragraph) .  The  certifier  at  each  site 
keeps  track  of  the  data  items  at  its  Bite  read  or  written  by  active  trans¬ 
actions.  When  the  certifier  at  site  A  receives  End^,  it  sees  if  any  active 
transaction  conflicts  with  at  site  A.  If  not,  is  certified  at 

site  A.  If  is  certified  at  all  sites  at  which  it  accessed  data, 

then  it  is  "really"  certified;  else  is  aborted. 

A  distributed  SG  certifier  shares  the  problems  of  distributed  SG 
schedulers:  the  certifier  needs  to  check  for  cycles  in  a  global  graph 
every  time  a  transaction  ends. 

6.5  Other  Architectures 

Centralized  and  distributed  scheduling  are  endpoints  of  a  spectrum.  On< 
can  imagine  hybrid  architectures  with  multiple  DM's  per  scheduler.  See 
figure  5.  This  architecture  adds  no  technical  issues  beyond  those  already 
discussed. 

Hierarchical  scheduler  architectures  are  also  possible.  See  figure  6. 
To  our  knowledge,  no  one  has  studied  this  approach  yet. 
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DATA  REPLICATION 

In  a  replicated  database ,  each  logical  data  item,  x,  can  have  many 

iueical  copies,  denoted  {x, ,  ...,x  },  which  are  resident  at  different 

1  in 

M's.  Transactions  issue  Reads  and  Writes  on  logical  data  items.  TM's 
ranslate  those  operations  into  Reads  and  Writes  on  physical  data.  The 
ffect,  as  seen  by  each  transaction,  must  be  as  if  there  were  only  one 
,*opy  of  each  data  item. 

There  is  a  simple  way  to  obtain  this  effect.  Each  TM  translates 
r\  [x]  into  r^  [x..  ]  for  some  copy  x..  of  x  and  w^^  [x]  into  {w^  [x^  ]  |  all 
copies  Xj  of  x}.  If  the  scheduler (s)  is  SR,  the  effect  is  just  like  a 
nonreplicated  database.  To  see  this,  consider  a  serial  log  equivalent  to 
the  SR  log  that  executed.  Since  each  transaction  writes  into  all  copies 
of  each  logical  data  item,  each  xri  tx^.  j  reads  from  the  'latest*  trans¬ 
action  preceding  it  that  wrote  into  any  copy  of  x.  But  this  is  exactly 

what  would  have  happened  had  there  been  only  one  copy  of  x.  (For  a  more 
rigorous  explanation,  see  [ABG] . )  Consider  this  example. 


and  *2  *r*  c°Pies  of  logical  data  item  x;  y^  and  y2  are  copies 

of  y.  Tq  produces  initial  values  for  both  copies  of  each  data  item.  T^ 
reads  x  and  y,  and  writes  xj  T2  reads  x  and  y,  and  writes  y. 
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is  SR.  It  is  equivalent  to  the  following  serial  log: 


L4  "  VXl]W0tx2lw0lyiJw0  Iy23rl  IxlJri  tyiIwl  IxiIwi  {x2Ir2  Ix2Ir2  ly2Jw2  tyiJw2  ty2J 

Note  that  each  Read,  e.g.  r2[x2]  or  r2^y2^'  reads  from  the  'latest'  trans¬ 
action  preceding  it  that  wrote  into  any  copy  of  the  data  item.  Therefore, 
has  the  same  effect  as  the  following  log  in  which  there  is  no  replicated 
data: 


wQ  [x]w0ly]r1  [x]^  [y^  [x]r2  [x]r2  [y]w2  [y]  . 


We  call  this  the  do  nothing  approach  to  repli cation — just  write  into  all 
copies  of  each  data  item  and  use  an  SR  scheduler. 

Two  other  approaches  to  replication  have  been  suggested.  In  the  primary 
copy  approach,  and  copy  of  each  x,  say  x^,  is  designated  its  primary  copy 
[Ston].  Each  TM  translates  r^x]  into  r^xj  for  some  coPy  * j »  as 
before.  Writes  are  translated  differently,  though.  The  TM  translates 


w,  tx]  into  a  single  Write,  w. [x  ],  on  the  primary  copy.  When  the  primary 
i  i  p 

copy's  scheduler  outputs  w^[x^],  it  also  issues  Writes  on  the  other  copies 

of  x  (i.e.,  w. lx, ] ,. . . ,w. lx  ]).  See  Figure  7.  These  Writes  are  processed 
x  1  i  m 

by  the  schedulers  for  x, ,...,x  in  the  usual  way.  For  example,  in  2PL, 

1  m 

the  scheduler  for  x.  must  get  a  write-lock  on  x.  for  T.  before  out- 

J  J  * 

putting  w^lXj].  The  primary  copy's  scheduler  may  be  centralized  (in  which 
case  the  technique  is  called  primary  Bite  [AD]),  or  distributed  with  the 
primary  copy's  DM. 


Primary  copy  is  a  good  idea  for  2PL  schedulers.  It  eliminates  the 
possibility  of  deadlock  caused  by  Writes  on  different  copies  of  one  data 


Transaction 


Begin 

« 

Write (x) 

e 

e 

End 


Scheduler 


DM 


Note:  x1  is  primary  copy. 


Figure  7 

Processing  Writes  In  Primary  Copy 


item.  Suppose  x  has  copies  and  x2»  Suppose  and  T 2  want 

to  Write  x  at  about  the  same  time.  Xn  the  do  nothing  approach,  the 
following  execution  is  possible:  T^  locks  x^;  T2  locks  x2;  tries 

to  lock  x2  but  is  blocked  by  T2's  lock;  tries  to  lock  x^  but  is 


blocked  by  T^'s  lock.  This  is  a  deadlock.  Primary  copy  avoids  this 


possibility  because  each  transaction  must  lock  the  primary  copy  first. 

In  the  voting  approach  to  replication,  TM's  again  distribute  Writes 
to  all  copies  of  each  data  item  [Thom] .  Assume  we  are  using  distributed 


schedulers.  When  a  scheduler  is  ready  to  output  w^[x_.],  it  sends  a  vote 


of  yes  to  the  vote  collector  for  x;  it  does  not  output  w^ [Xj ]  at  this 


time.  When  the  vote  collector  receives  yes  votes  from  a  majority  of 
schedulers)  it  tells  all  schedulers  to  output  their  Writes.  (Each  scheduler 


may  need  to  update  its  local  data  structures  before  outputting  w^lx..],  e.g. 


set  a  write-lock  on  x_..)  Assume  each  scheduler  is  correct  (i.e.,  produces 


an  acyclic  SG).  Often,  since  every  pair  of  conflicting  operations  was  voted 
yes  by  some  correct  scheduler  (both  operations  got  a  majority  of  yes's),  the 
SG  must  be  acyclic  and  the  result  is  correct. 

The  principal  benefit  of  voting  is  fault  tolerance;  it  works  correctly 
as  long  as  a  majority  of  sites  holding  a  copy  of  x  are  running.  See 


[Thom,  Giff]  for  details. 


8.  MU STIVERS I ON  DATA 


Let  us  return  to  a  database  system  model  where  each  logical  data 
item  is  stored  at  one  DM. 

In  a  multiversion  database  each  Write,  [x] ,  produces  a  new  copy  (or 
version )  of  x,  denoted  x1.  Thus,  the  value  of  x  is  a  set  of  versions. 
For  each  Read,  r^[x],  the  scheduler  selects  one  of  the  versions  of  x  to 
be  read.  Since  Writes  don't  overwrite  each  other,  and  since  Reads  can  read 
any  version,  the  scheduler  has  more  flexibility  in  controlling  the  effective 
order  of  Reads  and  Writes. 

Although  the  database  has  multiple  versions,  users  expect  their  trans¬ 
actions  to  behave  as  if  there  were  just  one  copy  of  each  data  item.  Serial 
logs  don't  always  behave  this  way.  For  example, 

.  0.  .  0.  .11,  .0  1,  .2, 

wQ(x  Jr  lx  ]w1lx  y  ]r2Ix  y  ]w2[y  ] 

is  a  serial  log,  but  its  behavior  cannot  be  reproduced  with  only  one  copy 
of  x.  We  must  therefore  restrict  the  set  of  allowable  serial  logs. 

A  serial  log  is  1-copy  serial  (or  1-serial)  if  each  r^lx^J  reads 
from  the  last  transaction  preceding  it  that  wrote  into  any  version  of  x. 

The  above  log  is  not  1-serial,  because  r2  reads  x  from  wQ,  but 
Wq[x°]  <w1(x1]  <r2(x°].  A  log  is  1- serializable  (1-SR)  if  it's  equivalent 
to  a  1-serial  log.  1-serializability  is  our  correctness  criterion  for 
multiversion  database  systems. 

All  multiversion  concurrency  control  algorithms  (that  we  know  of) 
totally  order  the  versions  of  each  data  item  in  some  simple  way.  A 
version  order,  «,  for  L  is  an  order  relation  over  versions  such  that, 
for  each  x,  <<  totally  orders  the  versions  of  x. 


Given  a  version  order  «,  we  define  the  multiversion  SG  w.r.t.  L  and 


«  (denoted  MVSG(L,«))  as  SG(L)  with  the  following  edges  added: 

for  each  r[xJ]  and  w^Ix  )  in  L,  if  x  « xJ  then  include 

else  include  T.  "+T.. 

i  k 

MULTIVERSION  THEOREM  [BG3J .  A  multiversion  log  is  1-SE  iff  there 
exists  a  version  order  «  such  that  MVSG(L,«)  is  acyclic.  o 

This  theorem  enables  us  to  prove  multiversion  concurrency  control 
algorithms  to  be  correct.  We  must  argue  that  for  every  log  L  produced  by 
the  algorithm,  MVSG(L,<<)  is  acyclic  for  some  «. 

The  types  of  multiversion  schedulers  that  have  been  proposed  fall 
into  two  classes  that  approximately  correspond  to  timestamping  and  locking. 

8.1  Multiversion  Timestamping 

Multiversion  concurrency  control  was  first  introduced  by  Reed  in  his 
multiversion  timestamping  method  [Reed].  In  Reed's  algorithm,  each  trans¬ 
action  has  a  unique  timestamp.  Each  Read  and  Write  carries  the  timestamp 
of  the  transaction  that  issued  it,  and  each  version  carries  the  timestamp 
of  the  transaction  that  wrote  it.  The  version  order  is  defined  by  x1  <<  x3 
if  TS (i)  <  TS(j) . 

** 

Operations  are  processed  first-come-f irst  served.  However,  the  version 
selection  rules  ensure  that  the  overall  effect  is  as  if  operations  were 

Note  that  two  operations  conflict  (and  produce  an  edge  in  SG(L))  if  they 
operate  on  the  same  version  and  one  of  them  is  a  write. 

Handshaking  is  used  to  ensure  that  logically  conflicting  operations  are 
executed  by  DM's  in  the  order  the  scheduler  output  them. 


processed  in  timestamp  order.  To  process  r^ [x] ,  the  scheduler  (or  DM) 
returns  the  version  of  x  with  largest  timestamp  <TS(i).  To  process 
wA [xj ,  version  xi  is  created,  unless  some  r ^ [x]  has  already  been 
processed  with  TS(j)  <TS(i)  <TS(k).  If  this  condition  holds,  the  Write 
is  rejected. 

An  analysis  of  MVSG(L,«)  for  any  L  produced  by  this  method  shows 
that  every  edge  T^  T_.  is  in  timestamp  order  (TS(i)  <TS(j)).  Thus 
MVSG (L,«)  is  acyclic,  and  so  L  is  1-SR. 


8.2  Multiversion  Locking 

In  multiversion  locking,  the  Writes  on  each  data  item,  x,  must  be 

ordered.  We  define  x1  «  x*1  if  w.  lx1]  <  w.  [x^ ) .  Each  version  is  in  the 

1  3 

certified  or  uncertified  state.  When  a  version  is  first  written,  it  is 
uncertified.  Each  Read,  r^  lx] ,  reads  either  the  last  (wrt  «)  certified 
version  of  x  or  any  uncertified  version  of  x.  When  a  transaction 
finishes  executing,  the  database  system  attempts  to  certify  it.  To  certify 
T^,  three  conditions  must  hold: 


Cl. 

For 

each 

r^lx^],  x3  is  certified. 

C2. 

For 

each 

w^lx*],  all  xj  «  x1  are  certified. 

C3. 

For 

each 

w.  lx1]  and  each  x')<<x1,  all  transactions  that  read 
l 

xj 

have 

been  certified. 

These  conditions  must  be  tested  atomically.  When  they  hold,  T^  is  declared 
to  be  certified  and  all  versions  it  wrote  are  (atomically)  certified. 

All  analysis  of  MVSG(L,<<)  for  any  L  produced  by  this  method  shows 
that  every  edge  T^  T  is  consistent  with  the  order  in  which  transactions 
were  certified.  Since  certification  is  an  atomic  event,  the  certification 


order  is  a  total  order.  Thus,  MVSG(L,«)  is  acyclic,  and  so  L  is 


1-SR. 


Two  details  of  the  algorithm  require  some  discussion.  First,  the 
algorithm  can  deadlock.  For  example,  in  this  log 


wQ [x°]r1 [x°]r2  tx°]w1 tx1]w2 [x2] 


and  T2  are  deadlocked  due  to  certification  condition  C3.  As  in  2PL, 
deadlocks  can  be  detected  by  cycle  detection  on  a  waits-for  graph  whose 
edges  include  -*■  T_.  such  that  is  waiting  for  T_.  to  become 

certified  (so  that  T.  will  satisfy  C1-C3). 

Second,  C1-C3  can  be  tested  atomically  without  using  a  critical 
section.  Once  Cl  or  C2  is  satisfied  for  some  r^x3]  or  w^lx1],  no 
future  event  can  falsify  it.  When  C3  becomes  true  for  some  w^Ix1],  we 
"lock"  x1  so  that  no  future  reads  can  read  versions  that  precede  x1. 
This  allows  C1-C3  to  be  checked  one  data  item  at  a  time.  Of  course,  the 
waits-for  graph  must  be  extended  to  account  for  these  new  version  locks. 


Two  similar  multiversion  locking  algorithms  have  been  proposed  which 
allow  at  most  one  uncertified  version  of  each  data  item.  In  Steams' 
and  Rosenkrantz ' s  method  [SR],  the  waits-for  graph  is  avoided  by  using  a 
timestamp-based  deadlock  avoidance  scheme.  In  Bayer  et  al's  method 
[BHR,  BEHR] ,  a  waits-for  graph  is  used  to  help  prevent  deadlocks.  This 
algorithm  consults  the  waits-for  graph  before  selecting  a  version  to  read, 


and  always  selects  a  version  that  creates  no  cycles. 

Multiversion  locking  algorithms  in  which  queries  (read-only  trans¬ 
actions)  are  given  special  treatment  are  described  in  [DuboJ ,  [BG4J. 
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COMBINING  THE  TECHNIQUES 


The  techniques  described  in  Sections  4-8  can  be  combined  in  almost 

all  possible  ways.  The  three  basic  scheduling  techniques  (2PL,  T/O,  SG 

checking)  can  be  used  in  scheduler  mode  or  certifier  mode.  This  gives 

six  basic  concurrency  control  techniques.  Each  technique  can  be  used  for 

2 

rw  or  ww  scheduling  or  both  (6  *36).  Schedulers  cam  be  centralized  or 
distributed  (36  *  2  *  72) ,  and  replicated  data  can  be  handled  in  three  ways 
(Do  Nothing,  Primary  Copy,  Voting)  (72*3*216).  Then,  one  can  use  multi¬ 
versions  or  not  (216  *2*432).  By  considering  the  multivarious  variations 
of  each  technique,  the  number  of  distinct  algorithms  is  in  the  thousands. 

To  illustrate  our  framework,  we  describe  some  of  these  algorithms 
that  have  already  appeared  in  the  literature. 

The  distributed  locking  algorithm  proposed  for  System  R* 
uses  a  2PL  scheduler  for  rw  and  ww  synchronization.  Hie  schedulers 
are  distributed  at  the  DM's.  Replication  is  handled  by  the  do  nothing 
approach . 

Distributed  INGRES  uses  a  similar  locking  algorithm  [Ston] .  The  main 
difference  is  that  distributed  INGRES  uses  primary  copy  for  replication. 

Many  researchers  have  proposed  algorithms  that  use  conservative  T/0 
for  all  scheduling  [SM,  Lela,  KNTH,  CB).  They  typically  distribute  the 
schedulers  at  DM’s  and  take  the  do  nothing  approach  to  replication. 

SDD-1  uses  conservative  T/0  for  rw  scheduling  and  Thomas'  write 
rule  for  ww  scheduling.  The  algorithm  has  distributed  schedulers  and 
takes  the  do  nothing  approach  to  replication  [BSR] .  SDD-1  also  uses 
conflict  graph  analysis,  a  technique  for  preanalyzing  transactions  to 
determine  which  run-time  conflicts  need  not  be  synchronized. 


/.V 


A  method  using  2PL  for  rv  scheduling  and  Thomas'  write  rule  for 
ww  scheduling  is  described  in  [BGL] .  Distributed  schedulers  and  the  do 
nothing  approach  to  replication  were  suggested.  To  ensure  that  the  locking 
order  is  consistent  with  the  timestamp  order,  one  can  use  a  Lamport  clock-. 
Each  message  is  timestamped  with  the  local  clock  time  when  it  was  sent; 
if  a  site  receives  a  message  with  a  timestamp,  TS,  greater  than  its  local 
clock  time,  the  site  pushes  its  clock  ahead  to  TS.  After  a  transaction 
obtains  all  of  its  locks,  it  is  assigned  a  timestamp  using  the  TW's  local 
Lamport  clock. 

Thomas'  majority  consensus  algorithm  was  one  of  the  first  distributed 
concurrency  control  algorithms.  It  uses  a  2PL  certifier  for  rw  scheduling 
and  Thomas' write  rule  for  ww  scheduling.  Schedulers  are  distributed  and 
voting  is  used  for  replication.  Each  transaction  is  assigned  a  timestamp 
from  a  Lamport  clock  when  it  is  certified.  This  ensures  that  the 
certification  order  (produced  by  rw  scheduling)  is  consistent  with  the 
timestamp  order  used  for  ww  scheduling. 

Each  of  these  algorithms  is  quite  conqplex.  A  complete  treatment  of 
each  would  be  lengthy.  Yet  by  understanding  the  basic  techniques  and  how 
they  can  be  correctly  combined,  we  can  explain  the  essentials  of  each 
algorithm  in  a  few  sentences. 


V*.  ' 


10.  PEBFORMANCE 

Given  that  thousands  of  concurrency  control  algorithms  are  conceivable, 
which  one  is  best  for  each  type  of  application?  Every  concurrency  control 
algorithm  delays  and/or  aborts  some  transactions,  when  conflicting  operations 
are  submitted  concurrently.  The  question  is:  which  algorithms  increase 
overall  transaction  response  time  the  least? 

Although  there  have  been  several  performance  studies  of  some  of  these 
algorithms,  the  results  are  still  inconclusive  [GS,  GM1,  GM2 ,  Lin,  LN] . 

There  is  some  evidence  that  2PL  schedulers  perform  well  at  low  to  moderate 
intensity  of  conflicting  operations.  However,  we  know  of  no  quantitative 
results  that  tell  when  2PL  thrashes  due  to  too  many  deadlocks.  There  are 
similar  gaps  in  our  understanding  of  the  performance  of  other  types  of 
schedulers.  More  analysis  is  badly  needed  to  help  us  learn  how  to  predict 
which  concurrency  control  algorithms  will  perform  well  for  the  applications 
and  systems  we  will  encounter  in  practice. 
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ABSTRACT 


Concurrency  control  is  the  activity  of  synchronizing  operations 
issued  by  concurrently  executing  programs  on  a  shared  database.  The  goal 
is  to  produce  an  execution  that  has  the  same  effect  as  a  serial 
(noninterleaved)  one. 

In  a  multiversion  database  system,  each  write  on  a  data  item  produces 
a  new  copy  (or  version)  of  that  data  item.  This  paper  presents  a  theory  for 
analyzing  the  correctness  of  concurrency  control  algorithms  for  multiversion 
database  systems.  Ke  use  the  theory  to  analyze  some  new  algorithms  and 
some  previously  published  ones. 


1. 


INTRODUCTION 


A  database  system  (dbs)  is  a  process  that  executes  read  and  write 
operations  on  data  items  of  a  database.  A  transaction  is  a  program  that 
issues  reads  and  writes  to  a  dbs.  When  transactions  execute  concurrently, 
the  interleaved  execution  of  their  reads  and  writes  by  the  dbs  can  produce 
undesirable  results.  Concurrency  control  is  the  activity  of  avoiding  such 
undesirable  results.  Specifically,  the  goal  of  concurrency  control  is  to 
produce  an  execution  that  has  the  same  effect  as  a  serial  (noninterleaved) 
one.  Such  executions  are  called  serializable. 

A  dbs  attains  a  serializable  execution  by  controlling  the  order  in 
which  reads  and  writes  are  executed.  When  an  operation  is  submitted  to 
the  dbs,  the  dbs  can  either  execute  the  operation  immediately,  delay  the 
operation  for  later  processing,  or  reject  the  operation.  If  an  operation 
is  rejected,  then  the  transaction  that  issued  the  operation  is  at-crted, 
meaning  that  all  of  the  transaction's  writes  are  undone,  and  transactions 
that  read  any  of  the  values  produced  by  those  writes  are  also  aborted. 

The  principal  reason  for  rejecting  an  operation  is  that  it  arrived 
"too  late".  For  example,  a  read  is  normally  rejected  because  the  value 
it  was  supposed  to  read  has  already  been  overwritten.  Such  rejections  can 
be  avoided  by  keeping  old  copies  of  each  data  item.  Then  a  tardy  read 
can  be  given  an  old  value  of  a  data  item,  even  though  it  was  "overwritten". 

In  a  multiversion  dbs,  each  write  on  a  data  item  x,  say,  produces  a 
new  copy  (or  version)  of  x.  For  each  read  on  x,  the  dbs  selects  one  of 
the  versions  of  x  to  be  read.  Since  writes  do  not  overwrite  each  other, 
and  since  reads  can  read  any  version,  the  dbs  has  more  flexibility  in 
controlling  the  order  of  reads  and  writes.  Several  interesting  concurrency 
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control  algorithms  that  exploit  multiversions  have  been  proposed  [BEHR,BHR, 
CFLNR,Dubo,Reed,Silb,  SLR, SR].  Theoretical  work  on  this  problem  includes 
[PK , SLR] . 

This  paper  presents  a  theory  for  analyzing  the  correctness  of  con¬ 
currency  control  algorithms  for  multiversions  dbs’s.  We  present  some  new 
multiversion  algorithms.  We  use  the  theory  to  analyze  the  new  algorithms 
and  several  previously  published  ones. 

Section  2  reviews  concurrency  control  theory  for  non-multiversion 
databases.  Section  3  extends  the  theory  to  multiversion  databases. 

Sections  4-6  use  the  theory  to  analyze  multiversion  concurrency  control 


algorithms . 


2. 


BASIC  SERXALIZABXLITY  THEORY 


The  standard  theory  for  analyzing  database  concurrency  control  algo¬ 
rithms  is  serializdbility  theory  [BSW, Casa, EGLT, Papa, PBR, SLR] .  Serializabi- 
lity  theory  is  a  method  for  analyzing  executions  allowed  by  the  concurrency 
control  algorithm.  The  theory  gives  a  precise  condition  under  which  an 
execution  is  correct.  A  concurrency  control  algorithm  is  then  judged  to  be 
correct  if  all  of  its  executions  are  correct. 

This  section  reviews  serializability  theory  for  concurrency  control 
without  multiversions. 

2 . 1  System  Model 

We  assume  the  dbs  is  distributed  and  use  Lamport's  model  of  distributed 
executions  [Lamp].  The  system  consists  of  a  collection  of  processes  that 
communicate  by  passing  messages.  The  model  describes  an  execution  in  terms 
of  a  happens  before  relation  that  tells  the  order  in  which  events  occur.  An 
event  is  one  of  the  following:  the  execution  of  an  operation  by  a  process, 
the  sending  of  a  message,  or  the  receipt  of  a  message. 

Within  a  process,  the  happens  before  relation  is  any  partial  order 
over  the  process's  events.  For  the  system,  the  happens  before  relation 
(denoted  <)  is  the  smallest  partial  order  over  all  events  in  the  system 
such  that:  (1)  If  e  and  f  are  events  in  process  P,  and  e  happens 
before  f  in  P,  then  e<f.  (2)  If  e  is  the  event  "Process  P  sends 
message  M"  and  f  is  the  event  "Process  Q  receives  M'" ,  then  e  <  f. 
Condition  (1)  states  that  <  must  be  consistent  with  the  order  of  events 
within  each  process.  Condition  (2)  states  that  a  message  must  be  sent 


before  it  is  received.  And,  since  <  is  the  smallest  partial  order  satis¬ 
fying  theses  conditions,  condition  (2)  is  the  only  way  that  events  in 
different  processes  can  be  ordered. 

This  paper  deals  at  a  higher  level  of  abstraction.  Hereafter,  we  will 
not  explicitly  mention  processes  and  messages  (except  briefly  in  Section  6) . 
For  concreteness,  the  reader  may  assume  that  each  transaction  is  a  process, 
and  each  data  item  is  managed  by  a  separate  process.  (Our  results  don't 
depend  on  these  assumptions . )  Under  these  assumptions  each  database  operation 
entails  two  message  exchanges.  For  transaction  to  read  x,  T.  must  send 

a  message  to  x's  process;  to  return  x's  value,  the  x  process  must  send  a 
message  to  T^.  The  same  message  pattern  is  needed  for  writes;  in  this  case, 
the  return  message  just  acknowledges  that  the  write  has  been  done.  Also 
under  these  assumptions,  any  decision  or  event  ordering  involving  one  data 
item  is  a  local  activity;  decisions  or  orderings  involving  multiple  data  items 
are  distributed  activities.  The  abstraction  that  we  use  hides  message 
exchanges  and  related  issues,  allowing  us  to  reason  about  concurrency  control 
at  a  higher  level. 

2.2  Logs 

Serializability  theory  models  executions  by  logs.  A  log  identifies  the 
Read  and  Write  operations  executed  on  behalf  of  each  transaction,  and  tells 
the  order  in  which  those  operations  were  executed.  A  log  is  an  abstraction 
of  Lamport's  happens  before  relation. 

A  transaction  log  represents  an  allowable  execution  of  a  single  trans¬ 
action.  Formally,  a  transaction  log  is  a  partially  ordered  set  (poset) 


-58- 


where  2^  is  the  set  of  Reads  and  Writes  issued  by  (an 
execution  of)  transaction  i,  and  tells  the  order  in  which  those 

operations  roust  be  executed.  We  write  transaction  logs  as  diagrams. 

rilx)  ^ 

T  *  w  [x]  . 

r  tz! 


represents  a  transaction  that  reads  x  and  z  in  parallel,  and  then 
writes  x.  (Presumably,  the  value  written  depends  on  the  values  read.) 

We  use  r^fx]  (resp.  ,  w^lx])  to  denote  a  Read  (resp..  Write)  or.  x 
issued  by  T^.  To  keep  this  notation  unambiguous,  we  assume  that  no  trans¬ 
action  reads  or  writes  a  data  item  more  than  once.  None  of  our  results 
depend  on  this  assumption. 

Let  T  =  {t_ , . . . ,T  }  be  a  set  of  transaction  logs.  A  dks  loo  (or 
u  n 

simply  a  log)  over  T  represents  an  execution  of  TQ,...,T  .  Formally, 
a  log  over  T  is  a  poset  L=  (I,<)  where 


1. 

I  =  Un  n  I .  ; 

1*0  i 

2. 

<  =>  Un  <  .  : 

—  1=0  i 

3. 

every  r^  lx] 

is  preceded  by  at  least  one  w  [x] 

(i  =  j  is  possih 1 

where  w^lx] 

precedes  r^ lx]  is  synonymous  with 

w .  (x]  <  r  [x]  ;  an 
i  J 

4.  all  pairs  of  conflicting  operations  are  <  related  (two  operations 
conflict  if  they  operate  on  the  same  data  item,  and  at  least  one  is 
a  Write) . 


Condition  (1)  states  that  the  dbs  executed  all,  and  only,  those  operations 

submitted  by  T_,...,T  .  Condition  (2)  states  that  the  dbs  honored  all 
0  n 

operation  orderings  stipulated  by  the  transactions.  Condition  (3)  states 
that  no  transaction  can  read  a  data  item  until  some  transaction  has  written 


its  initial  value.  Condition  (4)  states  that  the  dbs  executes  conflicting 


operations  fSequentially.  E.g.,  if  reads  x  and  T_.  writes  x, 

r^tx]  happens  before  w^  lx)  or  vice  versa;  the  operations  cannot  occur  at 
the  sane  time. 

Consider  the  following  transaction  logs 


Vy] 

V21 


r^x] 


r^  lz] 


w1  [x] 


The  following  are  some  of  the  possible  logs  over  {Tq.T^} 

I: 


(1) 


wQIx]  r^Ix] 
w0Iy] 

WQ  [z ]  -►  r^  lz] 


(2) 


woIxl 

v0ly] 

w0[2] 


r.  lx] 


w1  [x] 


lz] 


(3) 


wolxJ 

w0Iy] 

w0(z] 


r1lx] 


z"i  [  z  3 


w1  [x] 


Note  that  orderings  implied  by  transitivity  are  usually 
Wq[x]  <w^[x]  is  not  drawn  in  the  diagrams,  although  it 
w  [x]  <  r  [x]  <  w  [x]  , 


not  drawn.  E.g. 
follows  from 


Notice  that  the  dbs  is  allowed  to  process  Read(x)  and  Read(z) 
sequentially  (cf.  (1)  and  (2)),  even  though  allows  them  to  run  in 

parallel.  However,  the  dbs  is  not  allowed  to  reverse  or  eliminate  any 
ordering  stipulated  by  T^. 

Given  transaction  logs 

Vxl 

Tq  =  w0[y] 


r4Ix) 
T4  -  r4[y] 

r4[zj 


*  r2[x] 


•  w2[y] 


w3ly] 


w3  ^ 


^ixl 

«1  [X) 

r^  [z] ^ 


the  following  is  a  log  over  ,7^ ,T3 ,T4) . 


The  following  is  another  log  over  the  same  transactions. 


w  tx]w  [y]w  [z]r  [x]w  [y]r  lx]r  (z]w  lx]r, [z)w  [y]w  [z)r  [x]r.[y]r  [z] 
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When  we  write  a  log  as  a  sequence,  e.g.  ,  we  mean  that  the  log  is 
totally  ordered:  each  operation  precedes  the  next  one  and  all  subsequent 
ones  in  the  sequence.  Thus,  in  L2»  wQ[x]  <wQly]  <w0[z]  <  r2  [x]  ...  . 


Equivalence 


Intuitively,  two  logs  are  equivalent  if  each  transaction  performs  the 

same  computation  in  both  logs.  We  formalize  log  equivalence  in  terms  of 

information  flow  between  transactions. 

Let  L  be  a  log  over  {T^,...,T  }.  Transaction  T.  reade-x-frorr. 

On  j 

T^  in  L  if  (1)  w^  [x]  and  r_.  [x]  are  operation^  in  L; 

(2)  w.[x]<r.[x];  and  (3)  no  w  [x]  falls  between  these  operations. 

1  3  R 

Two  logs  over  {TQ,...,Tn}  are  equivalent,  denoted  =,  if  they  have  the 
same  reads-from' s i.e.  for  all  i,j,  and  x,  T_.  reads-x-from  T^  in  one 
log  iff  this  condition  holds  in  the  other.  This  definition  ensures  that 
each  transaction  reads  the  same  values  from  the  database  in  both  logs. 
Consider  logs  and  L2  of  the  previous  section.  These  logs 


have  the  same  reads-from' s : 


reads-x-from  TQ,  T^  reads-z-from  TQ 
T2  reads-x-from  Tg 
T^  reads-z-from  TQ 

T,  reads-x-from  T,  ,  T.  reads-y-from  T,,  T.  reads-z-from  T,. 

4  1  4  •*  3  4  3 

Therefore,  =  L2- 

This  definition  of  log  equivalence  ignores  the  final  database  state 
produced  by  the  logs.  For  example,  these  logs  are  equivalent 
L  «  Wg[x]  ^tx] 


L’  ■ 


w^x]  Wgtx] 
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even  though  different  transactions  produce  the  final  value  of  x  in  each 
log.  It  is  often  desirable  to  strengthen  the  notion  of  equivalence  by 
insisting  that  for  each  x,  the  same  transaction  writes  the  final  value  of  x 
in  both  logs.  This  can  be  modelled  by  (i)  adding  a  "final  transaction"  that 
follows  all  other  transactions  and  reads  the  entire  database  (e.c.,  T4  in 
Logs  L^  and  L^) ;  and  (ii)  redefining  equivalence  to  be  that  the  logs  have  the 
same  reads-from's  and  the  same  final  transaction. 

2.4  Serializable  Logs 


A  serial  log  is  a  totally  ordered  log  on  I  such  that  for  every  pair  of 
transactions  and  T  ,  either  all  of  T^'s  operations  precede  all  of 

T_.'s,  or  vice  versa  (e.g.,  L^  in  Section  2.2).  A  serial  log  represents 
an  execution  in  which  there  is  no  concurrency  whatsoever;  each  trans¬ 
action  executes  from  beginning  to  end  before  the  next  transaction  begins. 
From  the  point  of  view  of  concurrency  control,  therefore,  every  serial  log 
represents  an  obviously  correct  execution. 

what  other  logs  represent  correct  executions?  From  the  point  of  view 
of  concurrency  control,  a  correct  execution  is  one  in  which  concurrency 
is  invisible.  That  is,  an  execution  is  correct  if  it  is  equivalent  to  an 
execution  in  which  there  is  no  concurrency.  Serial  logs  represent  the 
latter  executions,  and  so  a  correct  log  is  any  log  equivalent  to  a  serial 
log.  Such  logs  are  termed  serializable  (SR). 

Log  L^  of  Sec.  2.2  is  SR,  because  it  is  equivalent  to  serial  loc 
of  Sec.  2.3.  Therefore  L, 


is  a  correct  log. 


3. 


MULTIVERSION  SERI ALIZ ABILITY  THEORY 


In  a  multiversion  dbs,  each  write  produces  a  new  version.  We  denote 
versions  of  x  by  x^,  x^...,  where  the  subscript  is  the  index  of  the 
transaction  that  wrote  the  version.  Operations  on  versions  are  denoted 
r  |x .  ]  and  w^[xj. 

3.1  Multiversion  Logs 

Let  T  * {tq, . . . ,T^}  be  a  set  of  transaction  logs  (defined  exactly 

as  in  Section  2.2,  i.e.,  the  operations  reference  data  items).  To 

execute  T,  a  multiversion  dbs  must  translate  T's  "data  item  operations" 

into  "version  operations".  We  formalize  this  translation  by  a  function  h 

which  maps  each  w. [x]  into  w.Jx.J,  and  each  r.[x]  into  r.[x  ]  for 
i  i  i  j 

some  j . 

A  multiversion  dbs  log  (or  simply  mv  log)  over  T  is  a  poset 
L *  (Z,<)  where 

1.  I*h(J^1  Zi)  ,  for  some  translation  function  h, 

2.  for  each  T.,  and  all  operations  op.  and  op!,  if  or  <  or ' 

l  l  l  iii 

then  h(op.)  <h(op!),  and 

i  l 

3.  if  h(r . [x] )  *  r . lx. ] ,  then  w. [x. J  <  r . [x. ] . 

3  3  i  ii  31 

Condition  (1)  states  that  each  operation  submitted  by  a  transaction  is 
translated  into  an  appropriate  multiversion  operation.  Condition  (2) 
states  that  the  mv  log  preserves  all  orderings  stipulated  by  transactions. 
Condition  (3)  states  that  a  transaction  may  not  read  a  version  until  it's 
been  produced. 

The  following  is  an  mv  log  over  ^Tn'Ti  ,T-> ,t-j 'Td ^  Section  2. 


All  mv  logs  over  a  set  T  have  the  same  write  operations,  since 


h(w^[x])  = [x J .  But  they  needn't  have  the  same  reads.  For  example. 


L4  has  r4ly2]  instead  of  rjy,). 


4lJ3J 


y.ralxo1N^ 

woIxo  \  »"ilxi’ 

rl 


"♦r4Ixi] 


W 


V2o] 


W - ♦X'2ly2l 

'V3lV.=**’3,y31 


3.2  MV  Log  Equivalence 

Most  definitions  and  results  from  basic  serializability  theory  extend 
to  mv  logs:  we  simply  replace  the  notion  of  "data  item"  by  "version"  in 
those  definitions  and  results.  However,  the  structure  of  mv  logs  simplifies 
the  treatment.  This  section  redoes  the  material  of  Sections  2.3  and  2.4 
for  mv  logs. 

Let  I,  be  an  mv  log  over  {T„,  —  ,T  }.  Transaction  T.  reads-x-frorr. 

On  j 

T^  in  L  if  T^  reads  the  version  x  produced  by  T^ .  By  definition, 
the  version  of  x  produced  by  T.  is  x. .  So,  T.  reads-x-from  T  iff 


T  reads  x^.  This  means  that  the  reads- from' s  in  L  are  determined  by 
the  translation  function  h,  viz.  the  way  h  translates  "data  item  reads" 
into  "version  reads". 

Two  mv  logs  over  {Tq,...,^}  are  equivalent,  denoted  =  ,  if  they  have 
the  same  reads-from' s.  The  reads-from's  in  an  mv  log  are  determined  by  its  read 
operations:  T ^  reads-x-from  iff  r^[x^]  is  an  operation  of  the  log. 

So,  two  logs  are  equivalent  iff  they  have  the  same  read  operations.  Moreover, 
since  all  mv  logs  over  the  same  transactions  have  the  same  writes,  equivalence 
reduces  to  a  trivial  condition. 


FACT  1.  Tv>o  mv  Iocs  ever  a  set  of  transactions  T  are  equivalent  iff  the 


logs  have  the  same  operations. 


Two  "version  operations"  conflict  if  they  operate  on  the  same  version 
and  one  is  a  write.  Only  one  pattern  of  conflict  is  possible  in  an  mv  log: 
If  op^  <  op^  and  these  operations  conflict,  then  op^  is  w^lx^]  and  op., 
is  r^Ix^].  Conflicts  of  the  form  w^ [x^]  <  w^ [x^]  are  impossible,  because 
each  write  produces  a  new  version.  Conflicts  of  the  form  r^Ix^]  <w^[xj 
are  impossible  since  T ^  can't  read  x^  until  it's  been  produced.  Thus 
all  conflicts  in  an  mv  log  correspond  to  reads-from's. 

The  serialization  graph  for  an  mv  log  is  defined  as  for  a  regular  log. 
Since  conflicts  are  so  structured  in  an  mv  log,  serialization  graphs  are 


quite  simple.  Let  L  be  an  mv  log  over  {Tg,...,T  }.  SG(L)  has  nodes 
TQ,...,Tn  and  edges  T^T^  (i^j)  such  that  for  some  x,  T^  reads-x-from  T^ . 
That  is,  T  ■* T  is  present  iff  for  some  x,  r^lx^]  is  an  operation  of  L. 
This  gives  us  the  following. 


FACT  2.  Let  L  and  L*  be  mv  logs  over  T. 

(i)  'if  L  and  L’  have  the  same  operations ,  then  SG(L)  *  SG(L' ) . 

(ii)  If  L  and  L’  are  equivalent ,  then  SG(L)  *  SG(L') .  o 

The  serialization  graphs  for  logs  and  of  the  previous  section 

are  given  below. 


(Compare  to  SG(L^)  in  Section  2.4.) 

3.3  One-Copy  Serializability 

Although  the  database  has  multiple  versions,  users  expect  their  trans¬ 
actions  to  behave  as  if  there  were  just  one  copy  of  each  data  item.  Serial 
logs  don't  always  behave  this  way.  Here  is  a  simple  example. 

w0 lx0lw0  ty0] ri lx0] W1 !yi] r2 ly0]W2 [X2]  • 

T2  reads-y-from  TQ  even  though  T^  comes  between  TQ  and  T^  and  produces 
a  new  value  for  y.  This  behavior  cannot  be  reproduced  with  only  or.e  copy  of 
y.  In  a  one  copy  database,  if  T^  comes  before  T^  and  T^  is  before  1^, 


then  T_  must  read  the  value  of  y  produced  by  T  . 


We  must  therefore  restrict  the  set  of  allowable  serial  logs. 


A  serial  mv  log  L  is  one-copy  serial  (or  1-serial)  if  for  all  i,j. 


and  x,  if  T\  reads-x-from  T\  then  i=j  or  T\  is  the  last  transaction  preceding 


that  writes  into  any  version  of  x.  (Since  L  is  totally  ordered,  the 


word  "last"  in  this  definition  is  well-defined.)  The  log  above  is  not  1- 
serial,  because  reads-y-from  TQ,  but  wq  ^o3  <  W1 3yl3  <  r2  3y03 '  L5  ^elow 


is  1-serial. 


L5  =  W0[xO]W0[y01W0[Z0Jr2tx03W2fy2]rilxO]ri{Z0}wltxiJr3[20]W3Iy3]W3l23J 


r4[Xl]r4[y3]r4lZ3] 


A  log  is  one-copy  serializable  (or  1-SR)  if  it's  equivalent  to  a 
1-serial  log.  For  example,  L3  of  Section  3.1  is  equivalent  to  ,  as  can 
be  verified  by  Fact  1;  hence  is  1-SR.  is  equivalent  to  no 

1-serial  log  (this  can  be  verified  by  checking  all  possible  serial  logs 
with  the  same  operations  as  );  hence  is  not  1-SR. 

It  is  possible  for  a  serial  log  to  be  1-SR  even  though  it  is  not 
1-serial  itself.  For  example 


W  rllx0]  W1(X1]  r2lX0] 


is  not  1-serial  since  T2  reads-x-from  TQ  instead  of  .  But  it  is  1-SR, 


because  it  is  equivalent  to 


w 


r2(X0] 


ri[xo]  wilxi3 


One-copy  serializability  is  our  correctness  criterion  for  multiversion 
concurrency  control.  The  following  theorem  justifies  this  criterion, 
proving  that  an  mv  log  behaves  like  a  serial  non-mv  log  iff  the  mv  log  is 
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First,  we  extend  our  notion  of  log  equivalence  to  handle  mv  and 
non-mv  logs.  Let  L  and  L*  be  (mv  or  non-mv)  logs  over  T.  L  and  L' 
are  equivalent,  =  ,  if  they  have  the  same  reads- from' s. 


1-SR  equivalence  theorem.  Let  L  be  an  mv  log  over  T.  L  is  equi¬ 
valent  i  a  serial ,  non-mv  log  over  T  iff  L  is  1-SR. 

Proof  (if) .  Let  Lg  be  a  1-serial  log  equivalent  to  L.  Form  a 

serial,  non-mv  log  Lg  by  translating  each  w^Jx^]  into  w^  [x]  and 

r.[x]  into  r.[x].  Consider  any  reads-from  in  L  ,  say  T.  reads-x-from 
li  3  s  J  3 

T..  Since  L  is  1-serial,  no  w .  tx,]  lies  between  w.[x]  and  r.[x  ). 

1  S  K  K  1  1  J  1 

Hence  no  w^tx]  lies  between  w\  [x]  and  r..  [x]  in  Lg  .  Thus, 

T.  reads-x-from  T.  in  L' .  This  establishes  L'  =  L  .  Since  L  =  L, 

3  is  s  s  s 

L  =  1/  follows  by  transitivity  (since  £  is  an  equivalence  relation) . 

(only  if).  Let  Lg  be  the  hypothesized  serial,  non-mv  log  equivalent 

to  L.  Translate  Lg  into  a  serial  mv  log  Lg  by  mapping  each  [x] 

into  w.[x.]  and  each  r.[x]  into  r.lx.]  such  that  T.  reads-x-from  T 
ii  3  31  3  1 

in  L' .  This  translation  preserves  reads-from' s ,  so  L  =L'.  Rv  trar.sitivitv , 
s  s  s 

L  £  L  . 
s 

It  remains  to  prove  that  Lg  is  1-serial.  Consider  any  reads-from 
in  L' ,  say  T.  reads-x-from  T. .  Since  L'  is  a  non-mv  log,  no  w  [x] 


lies  between  w. (x)  and  r.[x].  Hence  no  w  [x  ]  lies  between  w  [x  ’ 
X  3  K  K  ^  A 

and  r  [ x . )  in  L  .  Thus,  L  is  1-serial,  as  desired. 
j  1  s  s 


3.4  The  1-Serializability  Theorem 

To  tell  if  an  mv  log  is  1-SR  we  use  a  modified  serialization  graph. 

Given  a- log  L  and  data  item  x,  a  version  order  for  x  is  any  (non-reflexive) 
total  order  overall  of  x's  versions  written  in  L.  A  version  order,  <<,  for 
L  is  the  union  of  the  version  orders  for  all  data  items.  A  possible  version 
order  for  L,  of  Section  3.1  (or  L  of  Section  3.3)  is 
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(  X0  <<  X! 

«  -  y0  «  y2  «  y3 

\  2  <<  2 
0  3 


Given  L  and  a  version  order  <<,  the  multiversion  serialization 
graph,  MVSG(L,<<)/  is  SG(L)  with  the  following  edges  added: 


(*) 

for  each  r,  [x .  ] 
k 

and  w^ [x^3 

in  L,  k/i,  if  x.  «x.  then 
i  0 

include  T^  T ^  , 

else  include 

T,  -*■  T . . 
k  i 

(Compare  to  SGd^)  in  Section  2-5.) 

The  following  theorem  is  our  principal  tool  for  analyzing  multi¬ 
version  concurrency  control  algorithms. 


1-SERIALIZABILITY  THEOREM.  An  mv  log  L  is  1-SP  iff  there  exists  a 
version  order  <<  such  that  MVSG(L,<<)  is  acyclic. 


Proof  (if) .  Let  Lg  be  a  serial  mv  log  induced  by  a  topological 
sort  of  MVSG(L,<<).  I.e.,  Lg  is  formed  by  topologically  sorting 

MVSG(L,<<),  and  as  each  node  T^  is  listed  in  the  sort.  T^ ' s 

operations  in  L  are  added  to  Lg  one  by  one  in  any  order  consistent  with  L. 

L  has.  the  same  operations  as  L,  so  by  Fact  1,  L  =  L  . 

5  5 

It  -remains  to  prove  that  L  is  1-serial.  Consider  any  reads-from 

s 

situation  say  T,  reads-x-from  T..  Let  w. [x.J  be  any  other  write  on  a 
k  j  i  i 


> 


l  *  «_* 


version  of  x.  If  xi  «  x..  ,  then  by  rule  {*)  of  the  MVSG  definition,  the 

graph  includes  T.-^T.,  This  edge  forces  T.  to  follow  T.  in  L  .  If 
1  j  3  is 

x.«x.,  then  by  rule  {*)  ,  MVSG(L,<<)  includes  T,  -»-T..  This  forces  T, 
j  i  k  1  k 

to  precede  T.  in  L  .  In  both  cases,  T.  is  prevented  from  falling  between 

1  S  X 

T.  and  T  .  Since  T.  was  an  arbitrary  writer  on  x,  this  proves  that 

3  K  1 

transaction  that  writes  a  version  of  x  comes  between  T.  and  T  in 

3  * 

L  .  Thus  L  is  1-serial. 


no 


s  s 

(only  if)  Given  L  and  << ,  let  MV(L,<<)  be  the  graph  specified  by 

statement  (*)  of  the  MVSG  definition.  Statement  (*)  depends  only  on  the 

operations  in  L  and  «;  it  does  not  depend  on  the  order  of  operations 

in  L.  Thus,  if  and  L 2  are  multi version  logs  with  the  same  operations, 

then  MV (L^ ,<<)  =  MV (L^ ,<<) ,  for  all  version  orders  <<. 

Let  L  be  a  1-serial  log  equivalent  to  L.  All  edges  in  SG(L  )  go 
s  s' 

"left-to-right" ,  i.e.  if  T.-+T.  then  T.  is  before  T.  in  L  .  Define 

13  x  3  s 

<<  by:  x.  «x.  only  if  T.  is  before  T.  in  L  .  All  edges  in  MV(L  ,«) 

1  j  1  3  s  ?  s 

are  also  left-to-right.  Therefore  all  edges  in  MVSG(L  ,<<)*MV(L  ,<<)  USG(L  ) 

s  s  s 

are  left-to-right,  too.  This  implies  that  MVSG(Ls»<<)  is  acyclic. 

By  Fact  1,  L  and  L  have  the  same  operations.  Hence, 

s 

MV(L,«)  =  MV (L  ,«).  By  Fact  2,  SG(L)=SG(L  ).  Therefore  MVSG(L,«)  = 
s  s 

MVSG (L  ,<<).  Since  MVSG(L  ,<<)  is  acyclic,  so  is  MVSG(L,<<).  o 

s  s 

Sections  4-6  use  the  1-Serializability  Theorem  to  analyze  multiversion 
concurrency  control  algorithms.  We  conclude  this  section  with  a  complexity 


result. 


3.5  1-Serializability  is  NP-Complete 


1-SR  complexity  theorem.  It  is  NI~corwclete  to  decide  vtether  ar.  rw  ice 


is  1-SR. 


Proof  (membership  in  NP) .  Let  L  be  an  mv  log  over  T.  Guess  a  1-serial 

log  L  over  T  and  verify  L  =  L.  By  Fact  1,  we  can  verify  L  i  L  by 
s  s  s 

comparing  the  logs'  operation  sets. 

(NP-hardness) .  The  reduction  is  from  the  log  SR  problem.  Let  L'  be 

a  non-mv  log  over  T.  Map  L'  into  an  equivalent  mv  log  L  by  translating 

each  w. [x]  into  w.lx.]  and  each  r.lx]  into  r.[x.]  such  that  T. 

i  i  l  3  31  3 

reads-x-from  T^  in  L'.  By  the  1-SR  Equivalence  Theorem,  L  is  1-SR  iff 
there  exists  a  non-mv  serial  log  L^  such  that  L  iL\  But,  by  transiti¬ 
vity,  L^  exists  iff  L'  is  SR.  Thus  L'  is  SR  iff  L  is  1-SR.  □ 

Papadimitriou  and  Kanellakis  prove  that  a  related  problem  is  NP- 

complete  [PK]:  Given  a  conventional  log  L,  can  one  transform  L  into  a  1-SR 

mv  log  by  mapping  each  w. [x]  into  w.lx.)  and  each  r . [x]  into  r.[x]  for 

1  11  3  31 

some  x^  where  w^  [x]  <r_.[x]?  This  problem  corresponds  to  choosing  versions 
for  reading  after  having  scheduled  the  operations.  Our  problem  corresponds 
to  choosing  versions  at  the  same  time  as  scheduling  the  operations. 


%  *.  *  •. 

•-  .  \ 
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3. 


MULTIVERSION  TIMESTAMPING 


The  earliest  multiversion  concurrency  control  algorithm  that  we  know 
of  is  Reed's  multiversion  timestamping  algorithm  [Reed]. 

Each  transaction,  T^ ,  is  assigned  a  unique  timestamp,  TS(i),  when  it 
begins  executing.  Intuitively,  the  timestamp  tells  the  "time"  at  which  the 
transaction  began.  Formally,  timestamps  are  just  numbers  with  the  property 
that  each  transaction  is  assigned  a  different  timestamp.  Each  Read  and 
Write  carries  the  timestamp  of  the  transaction  that  issued  it, and  each  version 
carries  the  timestamp  of  the  transaction  that  wrote  it. 

Operations  are  processed  f irst-come-f irst-served.  But  the  translation 
from  data  item  operations  to  version  operations  makes  it  appear  as  if 
operations  were  processed  in  timestamp  order. 

The  algorithm  works  as  follows. 

•r. [x]  is  translated  into  r,[x  ],  where  x  is  the  version  of  x 
X  1  K  K 

with  largest  timestamp  <TS (i) . 

•w. [x]  has  two  cases.  If  the  dbs  has  already  processed  r.[x  ]  such 
i  Ik 

that  TS  (k)  <TS(i)  <TS(j),  then  w^  [x]  is  rejected.  Otherwise  w^[x] 

is  translated  into  wMxJ.  Intuitively,  w^[x]  is  rejected  if  it 

would  invalidate  r . [x  ] . 

J  k 

We  wish  to  use  serializability  theory  to  prove  this  algorithm  correct. 

To  do  so,  we  must  state  the  algorithm  in  serializability  theoretic  terms. 

We  take  the  description  of  the  algorithm  above  and  infer  properties  that  all 
logs  produced  by  the  algorithm  will  satisfy.  These  properties  fomn  our 
formal  definition  of  the  algorithm.  We  use  serializability  theory  to  prove 
that  these  log  properties  imply  1-serializability . 

The  following  properties  form  our  formal  definition  of  the  mv  time¬ 
stamping  algorithm.  Let  L  be  an  mv  log  over  {t^,...,^}. 


TS1.  Every  has  a  numeric  timestamp  TS(i)  satisfying  a  unique¬ 

ness  condition:  T5(i)*TS(j)  iff  i-j. 

TS2.  Every  r .lx.]  and  w.Ix.]  are  <  related;  i.e.,  r,  [x.]<w.[x  ] 
-  kg  11  kjii 


or  vice  versa. 


TS3. 1  For  every  r.  [x  .  ]  ,  TS  (i)  <TS  (k)  . 
-  k  ] 


TS3. 2  For  every  r.  lx .  ]  and  w.Ix.],  i  *  i  ,  if  w.  Ix.  ]  <  r.  [x  1.  then 
-  kg  li  likj 


either  TS(i)<TS(j)  or  TS(k)<TS(i) 


TS4 .  For  every  r.  [x . ]  and  w.Ix.].  i*i .  if  r  lx  .  ]  <  w .  lx.  ]  .  then 
-  kg  li  k  g  l  l 


either  TS(i)<TS(j)  or  TS  (k)  «E  TS  (i )  . 

Property  TS1  just  says  that  transactions  have  unique  timestamps.  TS2 
is  implicit  in  the  description  of  how  the  algorithm  works;  without  this 


property,  the  statement,  "If  the  dbs  has  already  processed  r.[x  ]..."  is 

D  K 


not  well-defined.  TS3  states  that  at  the  time  r^lx^]  is  processed,  x_. 


is  the  version  of  x  with  largest  timestamp  <  TS(k).  TS4  states  that  once 


the  dbs  has  processed  r  lx.],  it  will  not  process  any  w.[x.]  with 

K  D  1  1 


TS  ( j )  <TS(i)  <  TS  (k) 


Properties  TS3.2  and  TS4  can  be  simplified.  By  TS2,  r  [x  ]  and 

k  j 


w.[x.]  are  <  related.  So,  TS3.2  and  TS4  are  equivalent  to 


TS5 .  For  every  r  lx.]  and  w.Ix.  3,  i  j  j ,  either  TS(i)<TS(j)  or 
K  3  1  1 


TS (k)  <TS(i)  . 

We  now  prove  that  any  log  satisfying  these  properties  is  1-SR.  In 
other  words,  mv  timestamping  is  a  correct  concurrency  control  algorithm. 


MULTIVERSION  TIMESTAMPING  THEOREM.  All  logs  produced  by  the  mv  tine- 
stamping  algorithm  are  1-SR. 


Proof.  Let  L  be  a  log  produced  by  the  algorithm.  Define  a  version 


order  by:  x.<<x,  implies  TS(i)  <TS{j).  We  prove  that  all  edges  in 


MVSG (L,«)  are  in  timestamp  order:  if  T^T  is  an  edge,  then  TS(i)  <TS(j) 


Let  T 


^"*T  be  an  edge  of  SG{L).  This  edge  corresponds  to  a 


reads- 


from  situation,  i.e.,  for  some  x,  T  reads-x-from  T^ .  By  TS3.1  TS(i)<TS(j); 

by  TS1,  TS  (i)  ^  TS  { j ) .  So  TS(i)  <TS(j),  as  desired. 

Consider  any  edge  introduced  by  rule  (*)  of  the  MVSG  definition.  Let 

w. [x. 1 ,  w.tx.],  and  r.tx.J  be  the  operations  stipulated  by  rule  (*).  There 
3  3  *  3 

are  two  cases.  (1)  x.  «x..  Then  the  edge  is  T.-*-T..  TS(i)  <TS(j)  comes 

13  ’13 

from  our  definition  of  <<.  (2)  x.  <<  x.  .  Then  the  edge  is  T.  -*T..  By  TS5, 

j  1  k  1 

either  TS(i)  <  TS  ( j )  or  TS(k)<TS(i>.  The  first  option  is  impossible, 
since  the  definition  of  <<  requires  TS(j)  <TS(i).  By  TS1,  TS(k)  ^TS(i). 

So,  TS(k)  <TS(i),  as  desired. 

This  proves  that  all  edges  in  MVSG(L,<<)  are  in  timestamp  order.  Since 


timestamps  are  numbers,  hence  totally  ordered,  it  follows  that  MVSG(L,<<) 
is  acyclic.  So  by  the  1-Serializability  Theorem,  L  is  1-SR. 


D 


5. 


MULTIVERSION  LOCKING 


Bayer  et  al.  [BEHR,  BHR]  and  Stearns  and  Rosenkrantz  [SRJ  have  presented 

multi version  algorithms  that  synchronize  using  a  technique  similar  to  locking. 

This  section  studies  a  generalization  of  their  algorithms.  As  in  the 

previous  section,  we  start  with  an  informal  description  of  the  algorithm. 

Then  we  state  log  properties  the  algorithm  induces.  Finally  we  prove  that 

these  log  properties  imply  1-serializability . 

Each  transaction  and  version  exists  in  one  of  two  states:  certified 

or  uncertified.  When  a  transaction  begins,  it  is  uncertified;  when  a  version 

is  written,  it,  too,  is  uncertified.  Later  actions  of  the  algorithm  cause 

the  transaction  and  all  versions  it  wrote  to  become  certified.  The  concept  of 

certified  corresponds  to  closed  in  [SR]. 

Let  c^[xj  be  the  event  "x^  is  certified."  The  algorithm  requires 

that  all  c. [x.J  and  r,  [x.J  be  <  related.  Also  all  e.fx.J  and  c.Ix] 
11  k  ]  11  D3 

must  be  <  related.  A  version  order  is  defined  by:  x.  «x.  iff 

i  D 

c.  [x. ]  <  c . [x  .  ] . 
i  i  J  D 

The  algorithm  works  as  follows. 

•r  [x]  is  translated  into  r. [x ,]  where  x,  is  either  the  lasz 
i  l  k  k 

(w.r.t.<<)  certified  version  of  x,  or  any  uncertified  version.  The  alcorithr. 
may  use  any  rule  whatever  for  deciding  which  of  these  versions  to  read. 

•w^lx]  is  translated  into  w^ [x J .  As  stated  above,  x^  is  uncertified 
at  this  point. 

•When  a  transaction  finishes  executing,  the  dbs  attempts  to  certify  it 
and  all  versions  it  wrote.  For  each  data  item  x  that  T^  wrote,  the  dbs 
tries  to  set  a  certify-lock  on  x  for  T^.  This  succeeds  iff  no  other 
transaction  already  has  a  certify-lock  on  x;  if  the  lock  can't  be  set,  T^. 
waits  until  it  can.  When  T^  has  all  of  its  certify-locks ,  two  further 
conditions  must  be  satisfied: 


is  certified. 


Cl. 

For 

each 

x,  that 
k 

T.  read, 
i. 

C2. 

For 

each 

x.  that 

i 

T .  wrote 
l 

that  is  already  certified,  all  transactions  that  read  x  have 

k 

been  certified. 

Attaining  Cl  is  just  a  matter  of  time;  once  Cl  is  satisfied  no  future 

event  can  cause  it  to  become  false.  To  attain  C2,  we  set  a  certify -token 

on  x  to  stop  future  reads  from  reading  certified  versions  of  x;  instead, 

they  may  read  x^  or  any  other  uncertified  version  of  x. 

When  these  conditions  hold,  T.  is  declared  to  be  certified.  This  fact 

1 

is  broadcast  to  all  versions  T.  wrote.  When  a  version  x.  receives  this 

1  1 

information,  it,  too,  is  certified;  i.e.,  the  event  c.[x.]  occurs.  When 

i  i 

x  is  certified,  the  certify-lock  and  certify-token  on  x.  are  released, 
l  i 

This  algorithm,  like  most  locking  algorithms,  can  deadlock.  Deadlocks 

can  arise  from  two  independent  causes:  (1)  waiting  for  certify- locks;  and 

(2)  waiting  for  conditions  Cl  and  C2.  To  detect  deadlocks,  the  algorithm  can 

use  a  directed  blocking  graph  whose  nodes  are  the  transactions,  and  whose  edges 

are  all  T^-+T^  such  that  T^  is  blocking  the  progress  of  T ^ .  There  is  a 

deadlock  iff  the  graph  has  a  cycle  [Holt,  KC] .  Deadlock  prevention  schemes  such 

as  those  in  [BG.RS2]  can  also  be  used.  The  system  should  keep  track  of  the  two 

types  of  deadlock  separately.  To  resolve  deadlocks  caused  by  certify-locks , 

the  system  should  force  one  or  more  transactions  to  give  up  enough  of  their 

certify-locks  to  break  the  deadlock;  these  transactions  can  try  later  to  get 

these  locks  back.  To  break  deadlocks  caused  by  Cl  and  C2 ,  the  system  must 

abort  one  or  more  transactions.  (Cascading  abort  is  possible  if  the  algorithm 

allows  transactions  to  read  uncertified  versions.) 

The  algorithm  induces  the  following  log  properties.  These  properties 

form  our  formal  definition  of  the  mv  locking  algorithm.  Let  L  be  an  mv 

log  over  {tq , . . . , T^ } -  And  let  us  augment  L  with  symbols  that  represent 

important  events  in  the  algorithm,  specifically:  for  each  Ti?  let  c± 


represent  the  event  T\  is  declared  to  be  certified  ;  for  each  version  x^ 

written  by  T.  ,  let  cl. lx.]  represent,  "The  dbs  sets  a  certify-lock  on  x 
1  11 

for  T,";  and  for  each  x.  ,  let  c.[x.]  represent,  "x.  is  certified." 
i  ill  i 

Ll.l  For  every  T. ,  c.  follows  all  of  T. *s  read’s  and  writes. 

-  ii  l 

LI. 2  For  every  x.  written  by  T. ,  cl . [x  ]  <  c  <  c . [x. ] . 

—  4  ^  ^2.  ii  ii  i 

Property  LI  says  that  a  transaction  is  certified  after  it  executes;  all 

certify-locks  must  be  obtained  before  the  transaction  is  certified;  and 

the  transaction  must  be  certified  before  its  versions  are  certified. 

L2.1  Every  cl.lx.]  and  cl.lx.J  are  <  related. 

-  ii  33 

L2. 2  For  every  x.  and  x.,  if  cl,[x.]<cl.[x.)  then  c .  [x .  ]  <  cl  [x 

-  l  3  1133  1133 

L2  says  that  certify  locks  conflict — two  transactions  cannot  simultaneously 

hold  certify-locks  on  the  same  data  item. 

L3.1  Every  r,  [x.]  and  c. [x. ]  are  <  related. 

-  k  3  11 

L3 . 2  For  every  r,  [x  ]  and  w . (x. ] ,  i  ¥  j ,  if  c . [x. ]  <  r,  [x  ]  and 
-  k  3  11  J  11  k  3 

c  .  I  x  .  ]  <  r  [  x  .  ]  ,  then  c.[x.]<c.[x  ]. 

3  3  k  3  1133 

L3  expresses  the  rule  for  translating  reads.  If  x_.  is  already  certified 

at  the  time  r,  [x.]  occurs,  then  x.  is  the  last  certified  version  at  that 
*  3  3 


L4. 1  For  every  r,  [x  .  ]  ,  k  f  j  ,  c  .  [x  J  <  c,_. 

k  3  33k 

L4.2  For  every  r,_  lx^]  and  w_.  lx..),  i  j*  j ,  if  r,  [x^  ]  <  c .  [x,  ]  and 

““ ““  K  ^  1  K  J  11 

c  .  lx  .  ]  <  c .  ,  then  c,  <  c ,  . 

331  k  1 

These  last  properties  are  certification  conditions  Cl  and  C2,  respectively. 
The  following  lemmas  extract  useful  properties  from  L1-L4. 


LEMMA  1.  Let  T.  and  T.  be  transactions  that  write  x.  Then 
1  3 

either  cl.lx.]  <c.  <c.lx.J  <  cl.lx]  <c.  <  c  .  lx .  ] , 

11  111  3  3  3  3  3 

or  cl.lx.]  <c.<c.lx.]<  cl.  [x.]<c.  <  c.  lx  ] . 


Proof. 


L2.1  requires  that  cl. lx.]  and  cl. Ix]  be  <  related. 

i  i  3  3 


Suppose  el. lx. ]  <  cl . lx  ] .  By  LI. 2,  cl .  [x. ]  <  c.  <  c. [x. ] ;  by  L2.2, 
i  i  33  i  i  iii 


c. lx. ]  <  cl. lx.] ;  by  LI. 2  again,  cl . lx. )  <  c  <  c  .  [x. ] .  This  establishes  the 
1*33  33333 


first  possibility  permitted  by  the  Lemma.  If  cl  lx  ]  <  cl . lx  ]  ,  the  same 

3  3  i  i 


argument  establishes  the  second  possibility. 


LEMMA  2. 

Properties  LI- 

-L4  imply 

L5.  For 

every 

w- 

Mj.  c.<v 

L6 .  For 

every 

W 

and  i  j  »  either  c^ 

Proof. 

(L5)  . 

By  LI, 

Cj<Cj[x^].  By  L4.1,  c_.(xj 

k  i 


by  transitivity. 

(L6)  Using  logical  manipulation  we  can  express  L3.2  as 


L3 . 2 '  (c.  [x.  ]  <  r.  Ix.])  «*  (c.  [x.  ]  <  r.  [x .  ] )  a  i  (c  lx  .  ]  <  r.  [x  .  ] ) 

-  ii  K  3  li  k3  33  k  3 


v  (c.  lx.  ]  <  c  .  [x  .  ] )  . 
ii  3  3 


By  L3.1,  the  first  term  on  the  right  hand  side  simplifies  to 


(c.  (x.  ]  <r  lx.])  a  (r  (x.  ]  <  c . lx  .  ] ) 
ii  *3  k3  33 


By  transitivity,  this  implies  (c.lx.J  <c.tx.])r  and  so  the  entire  right  hand 

113  3 


side  implies  c . [x. ]  <  c . [x . ] .  By  Lemma  1 ,  this  implies  c .  <  c . .  So  L3 . 2 ' 
1133  13 


implies 


L3.2"  (c.  lx.  ]  <  r.  lx  ] )  -»  c.  <  c. 

-  11  k  3  13 


Similarly,  we  can  express  L4.2  as 


L4 . 2 '  (r  [x. ]  <  c.  lx. ] )  *-i(c[x.]<c.)v(c  <c.) 
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By  Lemma  1,  c  lx  j  and  c.  are  <  related.  So  the  first  term  on  the 
D  D  i 

right  hand  side  simplifies  to  (c^  <  c_.  [x_.  J ) .  By  Lemma  1,  again,  this  is 
equivalent  to  c^<c^.  So  L4.2'  is  equivalent  to 

L4.2"  ( r.  [  x  .  ]  <  c .  [  x .  ] )  ^  c .  <  c  .  v  c.  <c. 

-  k  3  ii  i  3  k  l 

L3.1  requires  that  r,  lx.]  and  c.[x.]  be  <  related.  This  lots  us 

k  j  ii 

drop  the  left  hand  sides  of  L3.2"  and  L4.2",  combining  them  into 

For  every  r  lx  ]  and  c ,  (x .  ]  ,  c .  <  c  .  v  c,  <c.  . 

k  3  ii  ljki 

Since  c.lx]  exists  iff  w. lx. ]  exists,  L6  follows.  D 

ii  ii 

We  now  prove  that  any  log  satisfying  these  properties  is  1-SR.  In 
other  words,  mv  locking  is  a  correct  concurrency  control  algorithm. 

nultiversion  locking  theorem.  All  legs  produced  by  the  mv  locking 
algorithm,  core  1-SR. 

Proof.  Let  L  be  a  log  produced  by  the  algorithm.  Define  a  version 

order  by:  x.  <<x.  implies  c.  <c,.  We  prove  that  all  edges  in  MVSG(L,«'<) 
ID  1  3 

are  in  certification  order:  if  T .  -*■  T .  is  an  edge,  then  c.  <c.. 

Let  T^  •*  T  be  an  edge  of  SG(L).  This  edge  corresponds  to  a 

reads-from  situation,  i.e.,  for  some  x,  T,  reads-x-from  T. .  By  L5, 

D  i 

c  <  c  . 
i  D 

Consider  any  edge  introduced  by  rule  (*)  of  the  MVSG  definition. 

Let  w. [x. ] ,  w  [x  ] ,  and  r,  lx]  be  the  operations  stipulated  by  rule  (*) . 
iijj  k  3 

There  are  two  cases.  (1)  x  <<  x  .  Then  the  edge  is  T.  -*-T..  c,<c. 

ID  i  D  i  D 

comes  from  our  definition  of  <<.  (2)  x,  <<  x  .  Then  the  edge  is  T,  -»T.. 

31  k  1 

By  L6,  either  c  <c  or  c  <c  .  The  first  option  is  impossible,  since 
1  j  k  1 

the  definition  of  <<  requires  c.<c..  So,  c,  <c.  as  desired. 

-si  V 


This  proves  that  all  edges  in  MVSG(L,<<)  are  in  certification  order. 
Since  the  certification  order  is  embedded  in  a  partial  order  (namely  L) ,  it 
follows  that  MVSG (L , << )  is  acyclic.  So,  by  the  1-Serializability  Theorem, 

L  is  1-SR.  o 

The  Stearns  and  Rosenkrantz  algorithm  [SR]  differs  from  ours  in  two 
respects.  They  allow  at  most  one  uncertified  version  of  a  data  item  to  exist 
at  any  point  in  time,  by  requiring  that  Write  operations  set  write- locks. 
Consequently,  their  algorithm  never  needs  more  than  two  versions  of  any  data 
item:  one  certified  version  and  at  most  one  uncertified  version.  This  fits 
nicely  with  database  recovery  [Gray] .  Stearns  and  Rosenkrantz  identify  the 
certified  version  of  a  data  item  with  its  "before-value",  and  the  uncertified 
version  with  its  "after-value."  The  other  difference  involves  deadlock 
handling.  Their  ■ jorithm  uses  an  interesting  new  deadlock  avoidance  scheme 
based  on  timestamps. 

The  Bayer  et  al.  algorithm  [BEHR,  BHR]  also  uses  at  most  two  versions 
of  each  data  item.  As  in  [SR] ,  the  versions  of  a  data  item  are  identified 
with  its  before-  and  after-values.  Unlike  [SR],  they  use  the  blocking  graph 
to  help  translate  data  item  reads  into  version  reads.  They  prove  that  they 
can  always  select  a  correct  version  to  read.  That  is,  reads  never  cause  a 
log  to  become  non-l-SR  and  never  cause  deadlocks.  This  is  a  good  property 
since  it  allows  read-only  transactions  (queries)  to  run  with  little 
synchronization  delay  and  no  danger  of  deadlock. 
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6.  MULTI VERS ION  MIXED  METHOD 

Prime  Computer,  Inc.  has  developed  an  interesting  multiversion  algorithm 

[Dubo] .  Prime's  algorithm,  like  those  at  the  end  of  Section  5,  integrates 

concurrency  control  with  database  recovery.  Unlike  those  algorithms,  Prime's 

algorithm  can  exploit  multiple  certified  versions  of  data  items.  Computer 

Corporation  of  America  has  adopted  Prime's  algorithm  for  its  Adaplex  dbs 

[CFLNR] .  This  section  studies  a  generalization  of  Price's  algorithm. 

The  algorithm  we  study  is  called  a  mixed  method.  A  mixed  method  is  a 

concurrency  control  algorithm  that  combines  locking  with  timestamping  [EG] . 

Mixed  methods  introduce  a  new  problem:  consistent  timestamp  generation.  A 

timestamping  algorithm  uses  timestamps  to  order  conflicting  transactions; 

intuitively,  if  and  T^  conflict,  then  T^  is  synchronized  before  T_. 

iff  TS  (i)  <  TS ( j )  .  A  locking  algorithm  orders  transactions  on-the-fly; 

intuitively,  if  T.  and  T  conflict,  then  T.  is  synchronized  before  T. 

l  j  i  J 

iff  c  <c  .  To  combine  locking  and  timestamping,  we  must  render  their 

i  j 

synchronization  orders  consistent. 

Our  algorithm  uses  mv  timestamping  to  process  read-only  transactions 
(queries) .  The  algorithm  uses  mv  locking  to  process  general  transactions 
(updaters) .  Queries  and  updaters  are  assigned  timestamps  satisfying  two 
properties : 


1. 

Let 

T  and  T.  be  updaters. 

If  c  <  c  then  TS  (l)  <  TS  ( j ) 

i  j 

i  j 

2. 

Let 

T  be  a  query  and  T.  an 

q  M  i 

updater.  If  r  [x  1  <  w.  [x  ) 
r  q  k  l  l 

then 

TS (q)  <  TS (i)  . 

A  consistent  timestamp  generator  is  any  means  of  assigning  timestamps  that 
satisfy  these  properties. 
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Our  algorithm  uses  a  Lamport  cloak  to  generate  consistent  timestamps. 
Recall  thfe  discussion  of  distributed  systems  from  Section  2.  A 
Lamport  clock  assigns  a  number  to  each  event  (called  its  time)  subject  to 
two  conditions. 

LC1 .  If  e  and  f  are  events  of  the  same  process  and  e  happened 

before  f,  then  time(e)  <time(f). 

LC2.  If  e  is  the  event  "Process  P  sends  message  M”  and  f  is 
the  event  'Process  Q  receives  M”,  then  time(e)  <time(f). 

LC1  is  easily  achieved  using  clocks  or  counters  local  to  each  process 
LC2  can  be  implemented  by  stamping  each  message  with  the  local  clock  time 
when  it  was  sent;  if  a  process  2  receives  a  message  whose  time  t  is 
greater  than  Q's  local  time,  2  pushes  its  clock  ahead  to  t. 

LC1  and  LC2  imply 

LC.  Let  e  and  f  be  events  in  a  distributed  system.  If  e  <  f 

then  time (e)  <  time (f)  [Lamp]. 

LC  is  precisely  the  condition  we  need  to  generate  consistent  time- 

stamps.  When  an  updater  is  certified,  the  process  that  certifies  it 

assigns  TS  (i)  *  time  (cJ  .  By  LC,  c^  <  c ^  implies  time  (c^)  <  time  (c_. )  ; 

hence  TS(i)  <TS(j)  as  desired.  When  a  query  T^  begins  executing,  we 

assign  TS(q)<the  current  Lamport  time.  So  for  all  reads  r^Ix^l, 

TS  (q)  <time(r  [x  ] ) .  Consider  any  write  w.[x.]  such  that  r  [x  J  <w  [x  ] 
<3  K  l  l 

By  locking  property  LI  (see  Section  5),  wMx^]  < Ci '  so  transitivity 
r^lx^l  <  c^.  By  LC  this  implies  time(r^[x^])  <time(cj;  hence  TS  (q)  <TS(i) 

as  desired. 

We  now  describe  the  algorithm  in  detail. 

*The  system  maintains  a  Lamport  clock. 

•Updaters  use  the  mv  locking  algorithm  of  Section  5. 


•  When  an  updater  is  certified,  the  system  assigns  TS(i)  *time(cj. 

This  timestamp  is  transmitted  to  all  versions  that  T^  wrote.  Thus,  certified 
versions  have  timestamps,  but  uncertified  versions  don't. 

•  When  a  query  T^  begins  executing,  the  system  assigns  TS(q)^the 
current  time. 

•  Consider  any  read  by  T  ,  r  [x] .  As  in  Section  4,  we  want  to  translate 
this  into  r^[x^l  where  x^  is  the  version  of  x  with  largest  timestamp  < 

TS (q) .  But,  some  care  is  needed  since  uncertified  versions  don't  have  time- 
stamps.  Let  t  be  a  lower  bound  on  the  possible  timestamps  of  any  uncer¬ 
tified  x  versions.  For  instance,  let  t * min{time (cl^ [xj ) |x^  is 

uncertified}.  Since  cl.Ix]  <  c. ,  time (cl . lx . ] )  is  a  lower  bound  on 

ill  11 

time  (cJ.  =  TS  (i)  ;  therefore  t  is  a  lower  bound  on  the  timestamps  of  any 

uncertified  x.  . 

i 

•  Consider  r  [x]  again.  If  x  has  no  uncertified  versions,  or  if 
a 

TS (q)  <  t,  then  r^[x]  reads  the  version  x^  of  x  with  largest  timestamp 

<  TS (q) ;  else  r^lx]  waits  until  the  condition  is  satisfied.  (This  will 
eventually  happen.) 

The  log  properties  induced  by  the  algorithm  are  a  simple  combination  of 
the  properties  induced  by  mv  timestamping  and  locking.  The  correctness  proof 
is  similar  to  those  in  Sections  4  and  5. 

MULTIVERSION  MIXED  METHOD  THEOREM.  All  logs  produced  by  the  mv  mixed 
method  are  1-SE.  o 

Prime’s  algorithm  differs  from  ours  in  two  respects.  Most  importantly, 
Prime's  algorithm  doesn't  use  explicit  timestamps.  All  certify  events  are 

<  related;  i.e.  c  ,...,c  are  totally  ordered.  The  algorithm  maintains 


a  list,  CL,  of  all  transactions  that  have  been  certified;  when  7\  is 
certified,  its  identifier,  i,  is  included  in  CL.  When  a  query  T^  begins 
executing,  it  makes  a  copy  of  CL,  denoted  CL(q).  When  T^  issues  a  read, 
r^[x],  it  reads  x^  where  x^  is  the  latest  version  (w.r.t.«)  of  x 
such  that  kCCL(q).  We  can  analyze  this  behavior  as  a  special  case  of  our 
mixed  method.  Imagine  that  each  updater  is  assigned  a  timestamp  equal 


to 


its  place  in  the  certification  total  order;  i.e.  TS(i)  * t  iff  T\  is 


the  t-th  transaction  to  be  certified.  Imagine  that  T^  is  assigned  the 
timestamp  TS(q)  =  |cL(q) |  +  £,  for  0  <  €  < 1.  This  is  a  consistent  way  of 


assigning  timestamps.  If  we  now  run  under  our  algorithm,  it  reads  the 

same  versions  as  under  Prime's  algorithm.  Since  our  algorithm  is  1-SR, 
so  is  Prime's. 


The  other  difference  is  that  Prime  uses  a  restricted  form,  of  multiversion 
locking  for  updaters,  namely  two  phase  locking  [EGLT] .  Write  operations  set 
write-locks,  so  no  data  item  ever  has  more  than  one  uncertified  version.  And, 
once  T^  writes  x,  no  updater  T^  reads  x  until  T^  is  certified,  and  vice 
versa.  Consequently,  every  updater  can  be  certified  as  soon  as  it  finishes 
executing. 

The  net  effect  is  that  queries  and  updaters  are  totally  decoupled.  Queries 
never  delay  or  cause  the  abort  of  updaters,  and  updaters  never  delay  or  cause 
the  abort  of  queries. 

Prime's  algorithm  is  most  naturally  implemented  in  a  centralized  dbs 
because  of  the  need  to  totally  order  certify  events. 

The  following  variant  is  more  suitable  for  a  distributed  dbs 

•The  system  maintains  a  Lamport  clock 

’Updaters  use  two  phase  locking,  hence  can  be  certified  as  soon  as  each 
finishes  executing.  The  system  assigns  TS(i)  ■  time (c^) ,  as  in  the  general  algorithms. 

’Queries  are  processed  using  timestamps ,  exactly  as  in  the  general  algorithm. 


This  algorithm  decouples  and  updaters  almost  as  fully  as  Prime's 
algorithm. «  Queries  never  delay  or  abort  updaters,  and  updaters  never  abort 
queries.  But  an  updater  can  delay  a  query  under  one  condition:  if  a  query 
T  reads  x,  updater  has  a  certify-lock  on  x,  and  TS(q)  > the  time  of 

that  certify-lock,  then  T  must  wait  until  T.  certifies  x. 
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CONCLUSION 


This  paper  has  studied  the  concurrency  control  problem  for  multiversion 
databases.  Multiversion  databases  add  a  new  aspect  to  concurrency  control. 
Transactions  issue  operations  that  specify  data  items  (e.g.,  read(x), 
write (x));  the  system  must  translate  these  into  operations  that  specify 
versions.  In  a  single  version  database,  concurrency  control  correctness 
depends  on  the  order  in  which  read's  and  write's  are  processed.  In  a  multi¬ 
version  database,  correctness  depends  on  translation  as  well  as  order. 

We  have  extended  concurrency  control  theory  to  account  for  the  trans¬ 
lation  aspect  of  multiversion  databases.  The  main  idea  is  one-copy 

% 

serializability :  an  execution  of  transactions  in  a  multiversion  database 
is  one-copy  serializable  ( 1-SR )  if  it  is  equivalent  to  a  serial  execution  of 
the  same  transactions  in  a  single  version  database.  A  multi version  concurrency 
control  algorithm  is  correct  if  all  of  its  executions  are  1-SR.  We  derived 
effective  necessary  and  sufficient  conditions  for  an  execution  to  be  1-SR; 
these  condition  use  the  concept  of  version  order.  We  gave  a  graph  structure , 
multiversion  serialization  graphs  ( MVSG's ) ,  that  helps  check  these  conditions. 
Once  a  version  order  is  fixed,  an  execution  is  1-SR  iff  its  MVSG  is  acyclic. 
MVSG's  are  analogous  to  the  serialization  graphs  widely  used  in  single  version 
concurrency  control  theory. 

We  applied  the  theory  to  three  multiversion  concurrency  control  algorithms 
One  algorithm  use  timestamps,  one  uses  locking,  and  one  combines  locking  with 
timestanqps .  The  timestamping  algorithm  is  due  to  Reed  [Reed].  The  locking 
algorithm  was  inspired  by  (and  generalizes)  the  work  of  Bayer  et  al.  [BEHR, 

BHR]  and  Stearns  and  Rosenkrantz  (SR].  The  combination  algorithm  generalizes 
an  algorithm  developed  by  Prime  Computer,  Inc.  [Dubo]  and  used  by  Computer 
Corporation  of  America  [CFLNR] . 
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Abstract 


Although  concurrency  control  in  both  centralized  and  dis¬ 
tributed  systems  is  a  widely  studied  problem,  not  many  formal 
quantitative  methods  are  known  for  analyzing  and  coirparing 
the  performance  of  the  proposed  algorithms  which  are  usually 
complex,  not  described  in  standard  terminology  and  each  of 
them  has  different  assumptions  regarding  the  underlying 
environment.  In  this  paper  we  analyze  dynamic  2PL,  proceed¬ 
ing  from  a  simple  deadlock  prevention  2PL  (which  gives  worst 
case  bounds  on  the  performance  of  any  deadlock  preventing  2PL) 
to  the  general  case  of  two  phase  locking  where  deadlocks  are 
allowed.  The  2PL  methods  considered  here  are  dynamic. 


1 . 0  INTRODUCTION 

Although  concurrency  control  in  both  centralized  and  distributed  systems  is  a 
widely  studied  problem,  not  many  formal  quantitative  methods  are  known  for  analyz¬ 
ing  and  comparing  the  performance  of  the  proposed  algorithms  which  are  usually 
complex,  not  described  in  standard  terminology  and  each  of  them  has  different 
assumptions  regarding  the  underlying  environment  (inputs,  communication  rules, 
etc.).  Fortunately,  it  is  possible  to  deconpose  every  concurrency  control  problem 
into  two  major  subproblems,  namely  read-write  and  write-write  synchronization 
[B-G,  80] .  Every  algorithm  must  include  a  subalgorithm  for  solving  each  of  these 
problems.  Current  research  indicates  that  only  a  few  subalgorithms  are  possible. 

In  fact,  most  of  them  are  variations  of  two  basic  techniques:  two  phase  locking 
(2PL)  and  timestanp  ordering  (T/0) .  Therefore,  it  is  important  to  develop  quanti¬ 
tative  methods  for  analyzing  the  performance  of  these  two  techniques. 

In  this  paper  we  analyze  dynamic  2PL,  proceeding  from  a  simple  deadlock  prevention 
2PL  (which  gives  worst  case  bounds  on  the  performance  of  any  deadlock  preventing 
2PL)  to  the  general  case  of  two  phase  locking  where  deadlocks  are  allowed.  The 
2PL  methods  considered  here  are  dynamic.  Locks  are  requested  in  a  distributed 
manner  over  the  lifetime  of  transaction,  as  found  in  most  2PL  methods.  The 
analysis  of  simple  2PL  is  used  primarily  to  serve  as  worst  case  performance  bound 
for  any  2 PL  methods.  In  the  analysis  of  the  general  2PL,  results  indicate  that 
the  rate  of  deadlocks  at  steady  state  is  approximately  proportional  to  the  mean 
number  of  transactions  in  the  system.  Our  techniques  for  analyzing  2PL  apply  to 
both  centralized  and  distributed  systems  by  suitably  adjusting  certain  assumptions, 
and  parameters. 


v  v 


2.0  ESSENTIAL  CONCEPTS  AND  DEFINITIONS 

A  database  consists  of  a  collection  of  logical  data  items  x,  y,  etc.  A 

logical  database  state  is  an  assignment  of  values  to  logical  data  items.  Users 
interact  with  the  database  by  executing  transactions.  Each  transaction  is  con¬ 
sidered  to  be  a  sequence  of  Read  and/or  Write  operations.  The  readset  (set  of 
logical  data  items  that  the  transaction  reads)  and  the  writeset  (set  of  logical 
data  items  that  the  transaction  writes)  are  part  cf  the  information  content  of  the 
transaction.  In  case  of  distributed  systems,  the  "stored"  readset  and  writeset 
refer  to  the  stored  copies  of  the  logical  data  items  in  the  various  places.  Two 
transactions  conflict  if  the  (stored)  readset  or  writeset  of  one  intersects  the 
(stored)  writeset  of  the  other  (read-write  and  write-write  conflicts) .  Two  trans¬ 
actions  need  synchronization  only  if  they  conflict  [B-G,  80]. 

Synchronization  in  update  processing  is  necessary  to  maintain  the  consistency  of 
databases.  Different  synchronization  techniques  can  be  applied  to  synchronize 
read-write  and  write-write  conflicting  transactions,  as  soon  as  they  induce  the 
same  total  order  in  the  transactions.  Two  phase  locking  and  timestamp  ordering 
are  the  two  most  common  methods  to  solve  these  subprdblems. 

2PL  synchronizes  read  and  write  operations  by  explicitly  detecting  and  preventing 
conflicts.  The  essentials  of  the  method  can  be  outlined  as  follows:  Before  read¬ 
ing  or  writing  on  a  data  item  x,  a  transaction  must  own  a  lock  on  x. 

There  are  two  rules  for  ownership  of  locks: 


(1)  Different  transactions  cannot  simultaneously  own  locks  that  conflict  (they 
conflict  if  they  are  both  on  the  same  item) . 
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(2)  Once  a  transaction  surrenders  ownership  of  a  lock,  it  may  not  obtain  addi¬ 
tional  locks.  The  second  rule  forces  transactions  to  obtain  locks  in  a  two-phase 
manner  (a  growing  and  a  shrinking  phase  in  terms  of  the  set  of  locks  owned  by  the 
transaction).  2PL  is  a  correct  synchronization  technique  in  the  sense  that  logs^ 
which  are  two  phase  locked,  are  serializable  [BSW,  79]. 

2PL  may  block  a  transaction  by  causing  it  to  wait  for  an  unavailable  lock.  In 
case  of  distributed  systems,  a  transaction  may  have  to  wait  in  a  site,  even  if  the 
site  is  currently  idle.  If  this  blocking  is  uncontrolled,  deadlocks  may  occur. 

A  waite-for'  graph  i?  a  directed  graph  indicating  which  transactions  are  waiting 
for  which  other  transactions.  Its  nodes  represent  transactions  and  its  edges  the 
wait-for  relationship.  (An  edge  is  drawn  from  transaction  T^  to  Tj  (i/j) 
iff  T^  is  waiting  for  a  lock  currently  owned  by  Tj).  A  deadlock  in  the  data¬ 
base  system  corresponds  to  a  cycle  in  this  graph.  JIf  a  deadlock  exists,  one 
(or  more)  transactions  in  the  cycle  are  restarted.  Avoidance  of  cyclic  restarts 
is  possible  if,  for  example,  the  "youngest"  transaction  in  the  cycle  is  always 
restarted. 


There  are  two  mechanisms  to  resolve  deadlocks,  namely  deadlock  prevention  and 
detection. 

1.  The  execution  of  transactions  is  modelled  by  a  set  of  logs,  each  of  which 
indicates  the  order  in  which  reads  and  writes  are  processed  at  one  data 
module. 


•v  •• 


3.0  ECADLOCX  PREVENTION  AND  DETECTION  TECHNIQUES  IN  2PL  ALGORITHMS 


Deadlock  prevention  is  a  "cautious"  scheme,  since  it  restarts  a  transaction  every 
time  the  system  suspects  that  a  deadlock  will  occur.  Suppose  Transaction  T*  is 
blocked  by  Tj.  A  procedure  test  (T^,Tj)  is  generally  applied  to  the  blocking 
pair.  If  the  pair  passes  the  test,  then  Ti  is  permitted  to  wait  for  T j .  Else, 
one  or  the  other  is  restarted.  If  T^(Tj)  is  restarted,  we  have  a  non-preemptive 
(preemptive)  deadlock  prevention  method.  The  test  (T^Tj)  must  ensure  that  the 
addition  of  the  edge  (T^,Tj)  to  the  waits -for  graph  cannot  introduce  a  cycle. 

Many  tests  have  this  property.  The  simplest  example  is  when  T£  is  always  re¬ 
started  (i.e.,  the  blocking  pair  never  passes  the  test).  This  "sinple"  2PI 
causes  no  waiting  but  many  restarts.  In  terms  of  restarts  it  may  be  consi-  e^: 
to  be  a  worst  case  of  the  prevention  techniques.  The  analysis  of  the  peri  nance 
of  simple  2PL  is  outlined  in  Section  5.0. 

Concurrency  control  algorithms  which  induce  fewer  restarts  are  priority  a:  r  e- 
stamp  based  [RLS,  78],  [Th,  79]  (wait-die,  wound-wait  2pl)  . 

In  deadlock  detection  schemes,  transactions  are  allowed  to  wait  for  each  other  in 
an  uncontrolled  manner  and  only  abort  transactions  when  a  deadlock  actually 
occurs.  The  real-time  waits-for  graph  is  searched  for  cycles  and  (at  least)  one 
transaction  in  each  cycle  is  aborted.  In  Section  6.0  we  analyze  this  general  2PL 
by  describing  the  evolution  of  the  waits-for  graph  as  a  Markovian  process.  The 
steady-state  deadlock  rate  is  found  to  be  approximately  proportional  to  the  mean 
number  of  transactions  in  the  system  and  the  steady  state  conflict  rate  approxi¬ 
mately  proportionately  to  the  mean  of  the  product  of  the  number  of  transactions 
in  the  system  and  the  number  of  unblocked  transactions.  Prior  to  the  discussion 
of  the  models,  let  us  begin  with  a  brief  review  of  the  current  research  in  this 
area.  r'-Vl 


4.0  PAST  WORK 


Garcia  Molina  [GM,  79]  compared  variants  of  centralized  2PL  to  the  majority 
consensus  algorithm  of  [Th,  78],  mostly  based  on  simulation  studies.  He  concluded 
that  centralized  2PL  out-performs  the  Thomas's  algorithm  under  all  tested  condi¬ 
tions.  His  assusptions  included  predeclaration  of  readsets  and  writesets  and  a 
fully  redundant  database.  The  analysis  was  limited  to  optimistic  situations  in 
which  run-time  conflict  never  occurred.  Ries  [R,  79]  analyzed  by  simulation 
various  forms  of  centralized  2PL  and  variations  of  distributed  2PL.  For  all 
variations,  static  locking  was  assumed  and  the  simulations  were  set  up  in  a  way 
that  no  deadlocks  occurred  and  run  time  conflicts  were  very  rare.  Consequently, 
the  main  result  was  that  all  methods  had  similar  behavior.  One  of  the  early 
analytic  approaches  to  performance  evaluation  of  concurrency  control  schemes  was 
by  Gelenbe  and  Sevcik  [Ge-S  78],  Their  analysis  was  mainly  on  timestamp  ordering 
methods.  However,  their  assumptions  about  the  update  rules  are  different 
from  the  current  imp lamentations.  They  defined  measures  of  "coherence"  (how  much 
the  values  of  various  copies  of  the  data  agree)  and  "promptness"  (how  up-to-date 
the  data  is) .  Their  results  were  based  on  assumptions  about  transmission  delays 
and  state  that  coherence  improves  rather  quickly  and  promptness  drops  slowly 
under  all  methods  considered.  Menasce  and  Nakanishi  [M-N,  79]  examined  "conflict 
oriented"  methods  analytically  and  locking  by  doing  simulations.  The  analysis 
was  mostly  done  on  "certification  methods."  (A  transaction  has  three  phases:  a 
read  phase,  a  computation  phase,  and  a  test  and  update  phase.  It  can  be  restarted 
only  at  the  end  of  the  last  phase.  After  the  transaction  has  visited  all  the 
places  it  wants,  its  timestamp  is  compared  against  timestamps  obtained  during  the 
read  phase)  .  Unfortunately,  their  assumptions  were  not  very  clear  and  the  model 
analysis  seemed  to  be  oversimplified.  D.  Pofcier  and  P.  Leblanc  ID-jl.,  80]  analyzed 
the  case  of  centralized  2PL  with  static  looking  (a  transaction  is  allowed  to  enter 
the  system  only  if  all  its  requested  locks  are  available) .  The  probability  that 
a  new  transaction  is  granted  its  locks,  given  k  other  transactions  already 
active,  was  found.  A  general  limitation  of  all  previous  analysis  on  2PL  is  that 
only  static  looking  is  examined.  Although  this  simplifies  the  analysis  with 
respect  to  time  dependence  of  state  changes  of  the  system,  the  issue  of  deadlocks 
and  restarts  will  be  avoided  under  static  locking.  In  order  to  be  able  to  quan¬ 
tify  the  degradation  on  performance  due  to  restarts  or  deadlocks,  it  is  important 
to  study  dynamic  locking.  In  addition,  since  most  real  systems  use  dynamic 
locking,  the  primary  objective  of  our  analysis  is  to  focus  on  the  performance  of 
dynamic  concurrency  control  systems  including  the  discussion  of  the  rate  of 
deadlocks  and  restarts. 


5.0  ANALYSIS  OF  THE  "SIMPLE"  2PL  AND  WORST  CASE  BOUNDS 


5.1  The  Model 

The  physical  database  is  assumed  to  be  composed  by  M  granules  (data  items)  with 
no  redundancy.  In  the  case  of  a  centralized  multiprogramming  system  the  granules 
are  located  in  some  storage  device  (e.g.,  disk).  In  case  of  a  distributed  data¬ 
base,  there  is  one  granule  per  site  and  the  M  sites  together  with  their 
communication  links  form  a  fully  connected  graph.  Links  and  processors  are 
assumed  completely  reliable. 

The  transactions  are  characterized  by  the  "hopping  behavior."  That -is,  each 
transaction  moves  from  granule  to  granule  and  requests  the  lock  on  arrival.  If 
the  lock  is  granted  then  the  transaction  "hops"  to  another  place,  after  spending 
seme  time  in  that  granule  (site) .  If  the  lock  is  denied  then  the  transaction 
simply  restarts;  hence,  this  is  the  worst  cast  of  2PL  in  terms  of  number  of 
restarts. 

Each  transaction  selects  its  move  to  the  next  granule  at  random  (independently  of 
other  transactions)  with  probability  (l-0)/M  for  each  of  the  M  granules2or  it 
completes  (exits  the  system)  with  probability  6.  (Note  that  this  includes 
revisiting  the  same  granule.  When  this  happens,  the  transaction  does  not  conflict 
with  itself  and  simply  spends  some  time  in  the  granule  and  then  it  selects  the 
next  granule  to  move  by  the  same  process . )  On  each  visit  to  :±he  granules ,  the 
transactions  obtain  service  for  a  random  time  interval  following  the  exponential 
distribution,  with  service  rate  y ' . 

Note  that,  because  of  our  assumptions,  the  total  lifetime  of  a  transaction  (con¬ 
ditioned  on  no  restarts)  again  follows  the  exponential  distribution  of  mean 

1  ^  1 
y  "  By' 

(geometric  sum  of  exponentials  of  same  mean) . 

In  case  of  a  restart  the  transaction  releases  all  its  locks  instantaneously  and 
follows  the  same  "hopping"  behavior.  Similarly  on  total  completion  the  transac¬ 
tion  releases  its  locks  instantaneously.  The  input  process  (arrival  time 
distribution  of  transactions)  is  assumed  to  be  Poisson  of  rate  X. 


2  [l-n,  81]  conducted  simulations  in  a  non-uniform  access  model  in  which  20%  of  the 
database  are  accessed  80%  of  the  time  they  found  that  the  behavior  is  very  similar 
with  the  behavior  of  the  random  access  model  of  heavier  load.  This  indicates  that 
the  access  pattern  is  not  the  dominant  factor  of  the  overall  performance  in  large 
databases. 


5.2  Justification  of  the  Assumptions 


The  Poisson  arrival  process  has  been  widely  used  to  model  open  transaction 
streams.  In  a  distributed  database,  this  assumption  is  further  justified  because 
the  input  process  is  the  sum  of  many  point  processes  (one  for  each  site)  and  this 
sum  converges  to  a  Poisson  process  under  weak  conditions.  Elimination  of  redun¬ 
dancy  is  analogous  to  the  primary  copy  policies  of  locking  in  redundant  systems. 

The  "hopping”  transaction  model  implies  that  at  any  time  instant  t,  a  transac¬ 
tion  can  be  active  only  at  one  granule  and  that  it  requests  locks  sequentially. 

This  is  true  in  most  centralized  databases.  Due  to  lack  of  commercial  distributed 
DBMS ,  this  assumption  is  yet  to  be  tested  in  cases  of  distributed  systems.  Having 
described  the  overall  model,  we  shall  now  proceed  with  the  analysis  study. 

Previously,  the  "hopping"  transaction  model  has  been  assumed  in  [R-S-L,  78]  for 
distributed  database  systems.  It  is  a  necessary  assumption  for  the  conflict  driven 
restart  methods,  including  wait-die  and  wound-wait,  to  work  in  distributed  systems. 
(See  [R-S-L, 78]  and  [B-G,78]).  The  authors  justify  the  use  of  the  "hopping"  model 
by  considering  systems  in  which  program  are  executed  at  single  sites  but  can  call 
other  programs  as  subroutines  at  other  sites.  The  execution  of  a  program  together 
with  all  its  associated  subroutine  calls  are  treated  as  the  execution  of  a  single 
process.  At  any  time,  the  "active"  site  is  the  site  of  the  subroutine  being  executed 
and  the  inactive  sites  are  those  where  programs  are  waiting  for  subroutines  to  return 
or  where  subroutines  have  finished  their  execution.  When  a  program  at  one  site  calls 
a  program  at  another  site,  both  programs  together  are  responsible  for  maintaining 
global  consistency. 

There  are  other  reasons  for  the  "hopping"  assumptions,  one  of  which  is  the  difficulty 
for  the  impel ementation  of  a  different  transaction  behavior  in  distributed  systems: 

The  users  would  have  to  write  "parallel"  programs,  producing  "transaction  incarnations 
active  at  different  sites  at  the  same  time.  This  would  require  sophisticated  synchro¬ 
nization  methods  for  "chasing"  the  incarnations  in  case  of  restarts,  deadlock 
avoidance  etc.  (For  such  systems  there  is  a  potential  danger  that  a  process  will 
be  rolled  back  at  one  site  and  made  permanent  at  another  site,  thereby  violating 
consistency  requirements. 

Furthermore  it  is  necessary  for  the  users  to  have  a  good  knowledge  of  information 
contained  in  sites  and  of  communication  delay  parameters  of  the  system.  Consequently, 
no  such  systems  are  currently  implemented. 

Having  described  the  overall  model,  we  shall  now  proceed  with  the  analysis  study. 


5.3  Performance  Analysis  of  the  "Simple"  2pl 


In  the  performance  study  we  shall  consider  the  following:  (1)  Nontrivial  lower 
bounds  for  the  steady  state  probability  that  a  transaction  can  con$>lete  without 
any  restarts;-  (2)  An  upper  bound  on  the  overage  number  of  restarts  per  transac¬ 
tion;  and  (3)  mean  response  time  of  a  transaction  (restarts  included). 


5.3.1 


i  is  of  the  System 


Assume  that  the  system  has  a  steady  state  and  let  £  be  the  steady  state  proba¬ 
bility  that  a  tr«iaaction  will  complete  without  any  restarts.  Consider  at  time 


a  "marked"  transaction 


obtaining  service  at  a  granule. 


is  said  to  be 


active  at  t.  Let  Yfc(t,dt)  be  the  Probfa^  will  conflict  at  (t,t+dt)  given  that 
Tj.  is  active  at  t  and  there  are  k  transactions  present  in  the  system},  then, 
ProblT.  will  conflict  at  (t,t+dt)}  is  singly  the  unconditional  probability  of 


Yfcfctdt)  which  we  denote  by  Y(t,dt), 
value  Y*dt.  Prom  that  assumption. 


Assume  that  Y(t,dt) 


idy  state 


Y*dt  «  lim  £  Yk(t,dt) *p(k,t) 
tr*«  k-1 

Where  p(k,t)  Prob{k  transactions  in  the  system  at  time  t}.  Consequently, 
the  average  rate  of  conflicts  at  steady  state  is  a  constant  (namely  y)  end  hence 
the  probability  f(T)  of  no  conflicts  (and  hence  no  restarts)  in  the  lifetime  T 
of  a  transaction  is  (by  basic  properties  of  the  Poisson  process). 


f (T)  -  e"Y'T 


In  addition. 


Prob{T^  does  not  restart  during  its  whole  life} 

■  ?  f(T)*Prob{T  £T  <  T  +dt}dT 
T-0 


-  /  e-^-ye’^dT 
T-0 


So,  we  conclude  that 


To  find  an  expression  for  the  restart  rate  at  steady  state,  we  begin  by  pointing 
out  that  in  order  for  to  conflict  at  (t,t+dt)  (and  consequently  restart) 

it  is  necessary  for  the  following  events  to  occur:  (1)  completes  service  at 
its  current  granule  within  (t,t+dt)  without  departing  from  the  system,  and  (2) 
The  next  granule  T^  chooses  must  be  one  of  the  granules  already  locked  by  one 
of  the  other  k-1  transactions  in  the  system.  Let  T2,T3...,  Tk, k  ^2  be  the 
other  transactions  and  let  L2 , ...,Lk  be  the  number  of  distinct  granules  locked 
at  t  by  T2, . . . ,  Tfc  respectively.  Then 


Yk(t,dt)  -  (u'dt) (1-6) 


V**+Lk 


Let  us  approximate  the  sum  Lj+* • *+1^  by  the  sum  of  the  mean  number  of  granules 
visited  by  each  transaction  (given  no  restarts) .  Since  this  mean  is  bounded 
above  by  1/0,  we  shall  approximate  L2+,l**+L){  by  (k-l)/0.  This  approximation 
is  pessimistic  since  1/0  is  an  upper  bound  achieved  only  in  systems  with 
infinite  msnber  of  granules  and  does  not  refer  to  distinct  granules.  Another 
reason  for  this  approximation  to  be  pessimistic  arises  from  the  fact  that 
(k-1) *1/0  increases  with  k  for  every  value  of  k  while  L2+***+Ljc  is  bounded 
by  M-l.  Fortunately,  it  is  desirable  to  be  pessimistic  here  since  the  results 
of  the  "simple"  2PL  will  be  worst  case  performance  bounds  for  any  2PL  method. 
Using  this  approximation,  we  can  obtain  the  following: 

•0  00 

lim  I  Yk(t,dt) *p(k,t)  -  lim  I  a  (k-l)dt*p(k,t) 
tr~  k-1  tr»*>  k-1 


.  uya-ei  .  ua-e) 

wher*  •  §M  n2.. 


«•  Y*dt  -  a*dt  I  I  kp{k)  -  I  p(k)  ♦  p(k' 
I  k—0  k— 0 


* 

-0) 

« 


-  a  E(k)  -  1  +  p(k< 


•0)  j  •< 


where  p(k),  p(k-0),  E(k)  are  the  steady  state  values  of  p(k,t),  p(0,t)  and 
swan  number  of  transactions  in  the  system  at  time  t. 


Substituting  Y  intc 


2)  we  get 


Finally, 


1-9 


p(k-O)  -  1 


^E(k)  +  p(k-0)  -  l\ 


Note  that  Eq.  (3)  is  derived  without  any  assumption  about  the  distribution  of 
transactions  in  the  system.  Pros  Eq.  (3),  we  get 


E  t 


1  + 


1 

1  -  B 

e2w 


E<k) 


(4) 


Since  E(k)  is  an  operational  variable,  it  could  be  obtained  by  measurement  in 
most  practical  systems.  So,  Eq.  <4)  can  be  used  as  a  mean  to  determine  a  lower 
bound  on  £  by  measuring  E(k).  Furthermore,  in  most  real  systems  the  maximum 
value  of  E(k)  is  bounded  by  a  constant  (maximum  multiprogramming  degree  due  to 
finite  memory  considerations,  thrashing). 

Let  C  be  such  a  bound  on  E(k).  Then, 


E  1 


1  + 


1  -  6 
02M 


(5) 


Assuming  that  the  upper  bound  of  E(k)  exists  and  is  finite  then  we  can  note  the 
following  from  Eq.  (3): 

If  3m  constant,  m  >  1  such  that  6  >  m(\^T~  )  ,  then  as  K  -*•  °o 

£  approaches  the  constant  value 


1  +  ~  •  E(k) 
m 

In  other  words,  if  the  fraction  of  the  database  that  each  transaction  accesses  is 
less  than  l/m\flT”  then  the  probability  of  passing  without  any  restart  is  nonzero 
even  at  high  traffic  (given  our  started  assumptions) . 


In  the  above  analysis,  one  should  note  that  for  the  case  1/6  ■  &(M)  using 
(k-l)/6  as  approximation  of  L2+***+Lj1  can  be  very  pessimistic.  For  example, 
when  1/0  -  m,  it  gives  L2+-**l5c  ~  M(k-l)  whereas  L2+***Ljc  <  M-l .  Thus,  in 
this  case  it  seems  to  be  more  appropriate  to  approximate  L2+*”Lj(  by  its  worst 
case  value,  namely  m-l;  then  Tj^tfdt)  becomes  y’  •  (1-9)  •.  (1-1/M)  *dt  (not 
depending  on  k) .  Hence 

p  .  -  a.  >  _ Hi! _ 

£  y  +r  -  ye  +  y'  (i-6)  (m-d/m 

mb _ 

E  -  M6  +  (1-6) (M-l) 


(7) 


From  Eq.  (7),  we  can  note  the  following: 

(1)  £  >  0  >  0 

(2)  The  lower  bound  of  Eq.  (7)  limits  to  0 

(3)  if  6  -  1/M  then  £  >  1/M-l 


Another  useful  bound  for  £  can  be  derived  from  Eq.  (3)  by  using  an  estimation  of 
E(k).  Instead  of  the  direct  value  of  E(k),  by  Little's  formula. 


where 


E(k)  -  X  •  T 


T  mean  time  a  transaction  spends  at  the  system,  and. 


T  *  Z  T  •  p(i-l  restarts) 
i-1  1 


where 


=  def 


*  mean  time  in  system  given  i-1  restarts. 

Since  p{0  restarts)  •  £  and 

-  ^  1  ,  1 
T  <  — r-r-  +  u  —  —  +  u 
0  -  y'6  y 

where  u  is  the  time  to  restart  a  transaction,  and  1/y  is  the  mean  time  of  a 
transaction  in  the  database  given  no  restarts , 

■»  Ti  —  ^  (i/U  +  u)  Vi 

Under  the  assumptions  in  the  analysis,  the  number  of  restarts  per  transaction  is 
geometrically  distributed  and  so 


p(i-l  restarts)  *  (l-£)  .  •  £ 


Hence, 


T  <  Z  i  (1/y  +  u)  <l-£)  7  £  <  1/£(1/U  +  u) 

”  i-1 


E(k)  <  (1/y  +  u) 

“  E 


and  from  Eq.  (4) 


E  1  +  T  *  \  (1/V  +u) 
02M  ^ 


(9 


Solving  Eg.  (8)  with  respect  to  £  we  get 

E  •  »  1  -  t  (1/y  +  u) 
e^M 

If  3m  constant,  m  ^  1,  such  that  0  >  m  (Jm~)  ^  then  we  get 

£  >  1  -  —  X(l/y  +  u) 
m 

and  we  see  that  £  has  nonzero  high  values  even  in  cases  of  X./y  »  1. 

Xn  systems  where  pages  are  the  units  of  locking,  e.g.,  system  R,  DMS1100,  the 
value  of  M  can  be  very  large.  So  let  us  now  consider  the  cases  when  M  »  1. 


In  here,  we  can  assxsne  u  to  be  negligible  relative  to  1/y.  In  this  case. 


E(k)  « 

VE 


from  Eg.  (9) 


«  >  i  1-0 

E  1  1  *  ~2~ 
y  m 


X 

y 


if  l<  ®2m 
if  yi^ 


(io 


Note  that  p(k-0)  increases  as  M  increases  and  aets  its  maximum  as  M  +  ».  in 
this  case,  the  system  can  be  apprxoimated  by  an  M|G I00  system  with  birth  rate  X 
and  death  rate  ky£  (It  >  2),  because  there  is  no  gueuing  in  the  system  due  to 
immediate  restarts  in  case  of  conflicts.  Hence  the  maximum  value  of  p (k*0)  is 
given  by  #-X/Vf£  Applying  Eq.  (3) 

E  “  1  +  +  e”X/yE-  -  1) 


(ii: 


e2M 

If  1/0  <  we  get  from  Eg.  (11) 

1-0  X 


1  - 


E  - 


02M 


1  +  ili  (e-VuE  .i) 

0  B 


Since  £  <  1,  we  get  for  X/y  <  1 


p  >  _ _ 

£  -  in  (1-X/y) 


(12 
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5.3.2  Overview  of  the  Bounds  Obtained  for 


Prom  the  analysis  above,  the  various  bounds  obtained  for  £  hold  for  certain 
ranges  of  values  of  \/\i,  6  (or  both) .  In  cases  in  which  more  than  one  bound 

equations  are  applicable,  one  should  select  the  best  lower  bound.  In  order  to 
evaluate  the  various  bounds,  we  proceed  to  obtain  the  exact  solution  of  the  model 
as  a  Markov  process  by  numerical  methods.  A  description  of  the  method  and  results 
follows  in  Section  5.3.3.  We  found  that  the  bounds 


“d  In  (I =TO"  for  u  <  1  and  7  ±  * 

approximate  the  actual  value  of  £  very  well  from  below  for  M  100.  The  bound 

M9 _ 

M6  +  (1-6) (M-l) 

yields  effective  results  for  small  M's  (M  <  10).  If  an  accurate  estimation  of 
E(k)  can  be  produced,  then  the  bound 

1 

1  +  V  E(k) 

0  M 

approximates  £  very  closely  from  below  in  all  cases  (see  Table  3) . 

Finally,  for  M  <  100,  the  computational  effort  for  an  exact  solution  is  accep¬ 
table,  and  it  can  be  used  to  get  an  accurate  estimation  of  £  if  necessary.  Let 
us  now  give  a  brief  summary  in  the  following  table  with  the  following  notation: 


LV. 


best  of  PL,  PTH 
(PLOG  if  applicable) 


best  of  PL,  PTH 


.  «  n 


Other 

high 

values 


best  of  PTH,  PK 


best  of  PK, 


10  <  M  <  100  max(PK,PTH)  or  exact  solution 


2 

For  X/u  »  max(l,  6  M/1-8)  the  derivation  of  analytical  tighter  lower  bounds  of 
£  is  under  current  investigation. 


Evidence  from  the  simulations  conducted  at  CCA  [  )  indicates  that 

performance  is  considerably  degraded  fro  high  values  of  X/p  .  Consequently, 
it  could  be  the  case  that  further  analysis  for  such  cases  may  not  be  necessary 
for  the  operational  range  of  existing  systems. 


We  developed  by  numerical  methods  the  exact  solution  of  the  global  balance  state 
equations  for  a  system  consisting  of  M  granules  and  a  CPU  (feeding  the  data¬ 
base)  with  maximum  MPL  equal  to  N.  We  assume  that  restarted  and  new  arrived 
transactions  join  the  queue  of  the  CPU.  The  system  is  modelled  as  a  Markov 
process  with  state  vector  (j,k,£)  where  j  <  N  is  the  number  of  transactions  in 
the  CPU  queue  and  k  <_  M  is  the  number  of  active  transactions;  £  is  the  total 
locks  held  (k  <  t  <  H) ,  We  assume  that  when  a  transaction  completes  or  restarts, 
it  frees  a  number  of  locks  which  is  a  random  variable  uniform  in  the  interval 
[1,  £-k  +  1] .  Ihe  other  parameters  of  the  system  include  the  CPU  rate  (yCPU)  , 
the  input  rate  \  and  the  granule  service  rate  y'  ,  6  is  the  exit  probability, 
as  in  the  simple  2PL  model.  The  solution  of  the  global  balance  equations  is 
produced  by  the  vowev  method  of  finding  the  eigenvector  corresponding  to  the  big¬ 
gest  eigenvalue  of  a  stochastic  matrix.  This  method  is  preferred  because  it  has 
minimal  storage  requirements. 


For  more  details  on  the  method,  see  [S-78] .  Our  results  show  that  the  lower 

bounds  P LOG, PL  work  well  for  M  100.  PK  works  well  for  any  value  of  M,  if 

E(k)  is  known.  For  smaller  M,  either  PTH  or  PK  is  suggested.  For  very 

small  M  (<50)  the  computational  effort  is  small  and  the  exact  solution  can  be 

used  as  an  alternative.  All  our  bounds  were  validated  to  be  true  pessimistic 
bounds  (reference  Table  3) . 


r  .  Bound  on  the  Average  Number  of  Restarts  Per  Transaction 


Let  n  denote  the  average  number  of  restarts  per  transaction.  Then 


(i-l  1  (1_E)i:E-)“ 


n  -  -  -  1 
£ 


from  Eq.  (3) 


From  Eq.  (4) 


n  -  [E (k)  +  pOc-0)  -  1] 
02M 


n  <  -  E(k) 

“  e^M 


In  general,  if  PBEST  is  the  best  lower  bound  on  £  (using  Table  1)  for  each 
case,  then 

n  <  -  -  1 

—  PBEST 

A  general  conclusion  is  that  if  3m  constant  ,  m  >  1  ,  such  that  >6  >  m(*$4)  " 
then  n  is  small  for  M  This  conclusion  applies  well  to  cases  where  M  >  100 

5.2.5  Average  Response  Time  of  a  Transaction 

The  average  response  time  K  of  a  transaction  is  bounded  above  by 

n  •  (1/y  +  u)  -  U/£)  (1/V  +  u) 

For  N  >  100,  we  can  use 

*<  (l/£)  l/0y  * 

where  ,  as  an  estimation  of  £  ,  the  best  value  of  PL,PLOG,PTK,PK  can  be 
used. 

This  completes  our  analysis  of  simple  2PL,  an  algorithm  which  should  be  viewed  as 
a  useful  theoretic  tool  to  get  worst  case  performance  bounds  of  any  dynamic  2PL 
method  in  terms  of  restarts.  In  the  next  section,  we  shall  proceed  to  the  more 
general  dynamic  -  2 PL  analysis. 


6.0  TOE  GENERAL  CASE  OF  TWO-PHASE  LOCKING  AND  THE  PROBABILITY  OF  DEADLOCK  IN  THE 
STEADY  STATE 

From  the  "simple”  2PL  analysis,  we  now  consider  the  more  general  case  of  2pl.  We 
shall  begin  our  discussion  of  the  general  case  of  2PL  with  an  overview  of  the 
mechanism  and  the  related  concepts,  primarily  w=s  are  interested  in  an  analytic 
model  to  estimate  the  steady-state  rates  of  conflicts  and  deadlock  in  a  database 
system  with  general  2PL  policy.  We  assume  that  each  transaction  T  sequentially 
requests  locks  on  a  randomly  selected  set  of  data  items.  All  locks  are  assumed  to 
be  exclusive  locks;  i.e.,  T  is  granted  a  lock  on  data  item  X  only  if  no  other 
transaction  currently  owns  a  lock  on  X.  If  the  lock  request  is  denied,  T  is 
placed  on  a  FIFO  queue  for  X.  All  locks  held  by  a  transaction  are  released 
simultaneously  when  the  transaction  completes.  The  input  process  of  transactions 
is  assumed  Poisson  of  rate  \.  Let  us  consider  the  waits-for  graph  at  time  t. 

The  nodes  in  this  graff  represent  all  active  transactions  and  the  edges  indicate 
which  transactions  are  waiting  to  obtain  locks.  In  particular,  there  is  an  edge 
from  T  to  T'  iff 

(1)  T'  is  immediately  in  front  of  T  on  the  queue  for  some  data  item  X  or 

(2)  T  is  the  first  transaction  on  the  queue  for  x  and  T'  currently  owns 
a  lock  on  X. 

Note  that  there  can  be  at  most  one  edge  emanating  from  any  transaction  T  because 
of  our  assumption  that  transactions  obtain  locks  sequentially.  T  is  said  to  be 
blocked  if  there  is  an  edge  emanating  from  it;  else  T  is  free.  New  (or  re¬ 
started)  transactions  enter  the  system  as  free  transactions,  whenever  a  new  edge 
is  added  to  the  graph,  it  must  emanate  from  a  previously  free  transaction  (since 

free  transactions  are  the  only  ones  that  can  request  locks ) .  We  shall  assume  that 
the  probability  of  adding  more  than  one  edge  to  the  graph  in  the  time  interval 
(t,t  +  dt)  is  negligible.  In  our  analysis  we  assume  that  deadlock  is  handled  by 
a  deadlock  detection  algorithm:  As  soon  as  a  deadlock  is  formed,  an  "oracle” 
detects  it  and  breaks  the  deadlock  cycle  instantaneously. 

Furthermore,  all  processing  requires  finite  time.  Based  on  the  above  assuirptions, 
let  us  next  consider  some  of  the  properties  of  the  time-varying  waits-for  graph. 

TWo  properties  follow  immediately: 

Suppose  transaction  T  is  blocked  by  transaction  T'  at  time  t.  Then  (1)  if 
T  and  T'  are  both  blocked  at  t  and  the  edge  (T,T')  does  not  belong  to  a 
cycle,  then  (T,T')  will  still  be  in  the  waits-for  graph  at  (t  +  dt)  (since 
processing  requires  finite  time).  (2)  if  T'  is  free  at  t,  then:,  either  (T,T 1 ) 
will  remain  in  the  graph  at  (t,t  +  dt)  (if  T'  does  not  complete  at  (t,t  +  dt) ) 
or  (T,T')  will  disappear  during  (t,t  +  dt)  (if  T'  completes  or  restarts 
during  that  interval) . 


Prom  these  properties  we  derive  the  following  state  transition  diagram  (from  t 
to  t  +  dt)  for  a  single  transaction 


LEMMA  6.1.  (1)  If  we  exclude  the  moments  at  which  cycles  are  formed  and  broken 

by  the  oracle,  then  the  waits-for  graph  at  time  t  is  a  forest  of  trees,  whose 
roots  are  the  free  transactions  and  the  other  nodes  are  the  blocked  transactions. 

(2)  A  deadlock  can  be  formed  at  (t,t  +  dt)  only  because  one  root  is  blocked 
by  one  of  its  descendants . 

For  the  proof  of  (1)  recall  that  a  transaction  can  only  be  blocked  at  one  data 
item  by  one  transaction.  To  prove  (2),  recall  our  assumption  that  the  probability 
of  adding  more  than  one  edge  to  the  graph  in  (t,t  +  dt)  is  negligible. 

Next  let  us  consider  the  trees  of  the  waits-for  graph  as  "nodes"  in  an  evolution¬ 
ary  random  graph.  Each  "node"  has  a  "size"  which  is  the  number  of  transactions 
in  the  corresponding  tree.  The  following  operations  change  the  graph  dynamically 
from  t  to  t  +  dt: 

1.  An  edge  is  formed  from  the  root  of  some  tree  T^  to  another  tree  T2. 

(T2  i1  T^) .  Let  sizefT^)  =  s1  and  size(T^)  =  s2.  Then  the  two  trees  merge  to  a 
single  tree  of  size  s-^  +  s2> 

2.  A  new  tree  of  size  1  appears  whenever  a  new  transaction  enters  the  system. 

3.  When  a  cycle  is  formed  from  the  root  of  a  tree  to  one  of  its  descendants,  then, 
the  oracle  immediately  restarts  some  node  participating  in  the  cycle.  For  pur¬ 
poses  of  this  analysis  we  shall  assume  that  the  oracle  restarts  the  root. 

4.  If  one  root  completes,  then  a  number  of  trees  equal  to  the  number  of  its 
iimnediate  descendants  will  be  formed. 


For  further  analysis,  we  make  the  following  assumptions :  (i)  The  service  time 

distribution 'for  each  free  transaction  in  a  data  item  is  exponential  with  mean  1/y 
(rate  y) .  (ii)  After  a  local  completion,  a  transaction  exits  the  system  with 
probability  6  and  requests  one  more  lock  with  probability  1-6,  0  £  0  £  1. 

(Note  that,  according  to  this,  the  average  number  of  locks  a  transaction  needs  is 
1/0).  (iii)  The  probability  that  a  particular  root  will  form  an  edge  with  a 
particular  tree  at  (t,t  +  dt)  is  proportional  to  the  size  of  the  tree.  This 
assumption  can  be  justified  if  we  accept  that  each  transaction  locks  (on  the 
average)  the  same  number  of  data  items  and  all  data  items  are  accessed  uniformly. 

Let  us  now  define  the  states  of  the  system: 

A  etate  of  the  evolutionary  random  graph  is  an  n-tuple  (sj_,  ...,sn)-  where  r.  » 
number  of  trees  in  system  at  t  and  s^  «  size  of  tree  T^  at  t. 

Let  Pjj  «  Prob(Ti  will  conflict  with  Tj  at  (t,t  +  dt)  given  T^  locally  com¬ 
pletes  and  requests  another  lock) . 

According  to  our  assumption,  p^ j  »  a*Sj  for  some  constant  a  (depending  on 
current  state) .  The  total  number  of  transactions  in  the  system  is  s  * 
when  the  state  is  (s^,...,sn).  Let  M  be  the  total  number  of  granules  in  the 
database.  The  total  number  of  locked  places  will  be  apprxoimated  here  by  the 
average  total  number  of  locked  granules,  a  pessimistic  estimation  of  which  is 
equal  to  1/0  s^.  By  using  the  above  approximation, 

Prob(T>  will  conflict  at  (t,t  +  dt)  given  it  locally  completes  at  t)  + 
Prob(T^  will  not  conflict  at  (t,t  +  dt)  given  it  locally  completes  at  t)  is 
equal  to  1. 

Hence, 

1  n 

n  M  "  0  Zi-1  si 

Zj-1  Pij  +  M 

implying  a  ■  1/0M. 


1 


LEMMA  6.2.  P^j  -  (l/0M)s.  Vi,  Vj.  When  the  system  is  in  state  (s^,...,sn)  at 
time  t,  the  overall  probability  of  deadlock  will  be  given  by 

(1-6)  (ydt)  Z*ml  pii 

From  this  Lemma,  we  can  readily  derive: 


rate  of  deadlock  =  ^  s^ 

rate  of  conflicts  at  t: 


n  n  ii  1-0  /  n  \ 

(1-6) y  2±ml  Ijml  Pij  -  m  ~ 9~  \  j*l  8j)  *  n 

which  imply  that  for  state  (s^,s2» . . . ,sn) 


rate  of  deadlock  1_ 
rate  of  conflicts  n 


(Equation  (6.1)) 


(Note  that  if  we  want  to  take  into  account  the  fact  that  a  transaction  will  not 
cause  deadlock  with  itself  then  we  cam  use  a  more  refined  assumption: 


fu  *  «•  »j  if  1  *  * 
PiA  «  a-  (s.-l) 


(AR) 


Reasoning  as  above,  the  value  of 
_  _  1  a 

“  ”  ei  '  TT 

For  large  s,  a  1/6M  again. 


a  will  now  be 
n 

where  s  » 

Approximating  then  a  by  1/6M  we  finally 


get 


rate  of  deadlock 


y  1^0 
M  6 


s 


rate  of  conflicts  at  t  ■  jj  pp  •  (s-1)  •  n 

which  imply  that  for  state  (s^,s2, • • • »sn) 

rate  of  deadlock  s 

rate  of  conflicts  (s-1)  •  n 


(Eq.  6. IB) 
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which  is  <*  1/n  for  large  s.) 

For  reasons  of  clarity  we  use  the  "crude"  assunption  (p. . 
the  analysis  following. 


Vi  Vj)  irr 


We  now  consider  the  probabilities  of  the  possible  state  transitions  of  the  system 
(from  time  t  to  t+dt) . 


(1)  An  arrival 

(#1  »  •  ••  #S  )  (s.  |  •  ••  F  S  F  1) 

a  n  x  n 

with  prob  -  X*dt 

(2)  A  local  completion  and  a  conflict  (not  a  deadlock) ,  for  n  >  1 


Vi,j  i  *  j,  l<i,  j<n 

f Sj  f • • • »s^)  (s^# 


»*J_1  »S?»  •  •  *'Sn) 


l-l'’i+l'*j-l'*j‘ 


where  st  ■  s^  +  s^ 
with  probability  (ydt)  (1 


-e) 


0M  H  M 
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(3)  A  deadlock  (the  root  in  a  tree  completes  locally  and  then  conflicts  with  its 
own  descendants) . 


Vj.  l<j<n  Vm  «  0 . Sj  -  1 

(Sjf  •  ••/Sj  f  •  •  <  fS^)  (Sj^,  »  .  .  f  (8j  f>F*rSj  )»...(S^f1) 


such  that 


a.  +  s .  +  •••  +  s.  ■  s.  -  1  and  S.  >  0 

ji  h  3»  3  1 


The  probability  of  this  transition  is 

s . 

(ydt) (1-0)  ^  Prob{m| deadlock  in  j)  multiplied  by 

Prcb{s.  , ...,s.  |m  and  deadlock  in  j} 

31  3m 

Obviously,  m  is  the  number  of  immediate  siblings  of  the  root  of  Tj  just  before 


the  deadlock. 


LEMMA  6.3:  The  Prob{m| deadlock  in  j)  »  P rob (the  root  of  Tj  has  m  immediate 
descendants  at  time  of  deadlock)  ean  be  apprxoimated  by 


where 


R(m,s.) 


Nfs^ 


for  all  m  ■  l,...,s^  -  1 


R(m,s^) 


number  of  trees  (labelled)  of  s.  nodes  and  a  distinct  specified 
root  whose  degree  is  m  ^ 


Nts^ 


number  of  labelled  trees  of  Sj  nodes  and  a  distinct  specified 
root,  (1  £  degree (root)  <  s^  -  1) 


such  that  the  following  hold  for  R{m,Sj)  and  N(s^)j 


(II)  R(m,k) 


kk'2,  k  >  2 


0  E 


V(k  ,...,k  ) 

±  ID 

such  that  k-+...+k 
X  in 


k-m-l 


k_  “1  /n-m-l-k  \  k»“l  A  »  k  ■ 

)<v 11  •(  Vvi>  Vv11 

-  1  2  m 


with  1  <  m  <  k  -  1 


(III)  N  (1) 


1  by  definition. 


Proof.  Assuming  uniformity  in  the  way  trees  merge  during  conflicts,  at  steady- 
state,  Prob{m | deadlock}  will  indeed  be  equal  to 

R(m,Sj) 

N(k)  by  definition  is  the  number  of  rooted  labelled  trees  of  k  2  nodes  and 
can  be  shown  to  be  equal  to  k*”2  [Ca,  89],  Note  that  the  trees  are  labelled 
since  we  have  distinct  transactions  as  nodes,  and  that  the  class  of  trees  which 
define  the  same  partial  order  (with  the  ordering  i  <  j  i  is  son  of  j)  must 
be  counted  as  only  one  tree. 

To  form  R(m,k) ,  1  £  m  £  k-1,  m  sons  of  the  root  are  to  be  selected  from  the 
k-1  nodes  in 


C:‘) 


ways,  and  for  each  of  them,  partition  the  remaining  k-l-m  nodes  in  m  groups 
(group  i  has  k^  nodes,  0  £  k.  £  k-m-l,  1  <  i  <  m) .  Then  count  all  possible 
ways  of  selecting  labels  for  eacft  of  these  m  groups  and  for  each  k.  and  one  of 
its  labellings,  we  have  to  count  all  possible  (labelled)  rooted  trees  formed  with 
k .  -*-1  nodes . 


EXAMPLE.  N ( 3)  -  3  {-  3  ) 
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R(2,4) 
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4b 


A. 
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Mote  that  f  and 

i  b  b 

since  they  give  the  same  partial  order 


count  as  1  tree 


6  -  3(1+1) 

a  a 


b^Wc  b^Vd  cj\b  c^\d  d  ^\c  d^^b 

In  order  to  calculate  the  transition  probability  due  to  deadlock,  it  is  necessary 
to  consider  the  prob{sj^, . . . ,Sj  |m  and  deadlock  in  Sj}.  When  deadlock  occurs 


in  the  j  tree,  the  Sj  -  1  descendants  will  be  fragmented  into  m  new  trees 

'  ’  *  into  m  random 


0  <.  a  <  Sj  -  1.  This  process  is  analogous  to  grouping  Sj  -  1  i 
partitions.  The  probability  of  a  certain  partition  is  [Fe,  66] 


probability 

Prob{s. 

3  * 


'i 


partition 

i  i  (vX)  1 

ln'Sj}  “  s.  Is.  !...s.  1 
31  32  3m 
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We,  therefore,  conclude  that  the  transition  due  to  deadlock, 

♦  (s, ,. 


<*1 . *j . sn) 


'*j-l' (sj1 . 


sj  )»•••**„»!) 
tt 


has  probability 


s  -i-l 

rl\j 

\M  /  N(s^) 

[V“V  ' 

(4)  A  departure  from  the  system.  (A  root  totally  completes  and  departs.) 


Vj  Vin  -  0,1,...  ,8^  -1 


(S1,...,S^,...,Sn)- 


. . j-1'  {\ . “jj  ,sj+l"  •  •'“n) 


(where  the  root  of  T.  completes)  .  By  the  same  reasoning  as  in  the  case  of  dead 
lock,  this  transition3 has  probability 


R(m,s.)  f  (s.-l)l  /.vV1! 


On  identifying  all  the  transition  probabilities,  the  process  is  now  completely 
defined.  Due  to  the  complicated  transitions  arising  in  cases  of  deadlock  and  total 
completion,  in  general  the  process  does  not  necessarily  observe  the  one-step 
behavior  as  defined  in  [D-B,  78].  Consequently,  it  is  conjectured  that  the  closed 
form  solution  for  the  steady  state  probabilities  is  unlikely  to  be  product  form  as 
defined  in  (BCMP,  76].  Nevertheless  one  can  obtain  numerical  solutions  for  the 
problem,  should  the  closed  form  solution  fail  to  exist. 

From  the  steady  state  solution,  we  can  now  confute  the  conflict  rate  r  and  dead¬ 
lock  rate  r.  at  steady  state  c 


$  £  (» •  (t) 

M  vft^esV  \j«l/ 


prob(s. ,...,s  ) 
x  n 


*  V  (  (S)  *3  '  Pt0b,*l . ”)  T* 

(remark  r<j  -  rc  •  1/n  for  all  states  in  state  space.)  Note  that  rc  and  rd 
only  require  the  aggregate  joint  probability. 


f(n,s)  -  Prob 


(±  -s  -  •) 

\j-l  3  / 


where 


n  “  number  of  free  transactions  in  system 
s  -  number  of  transactions  in  system. 


From  the  above,  we  conclude 


LEMtA  6.4. 


-  £  E(n-s)  — 


■  *  • ««  •  ¥ 


or  the  rate  of  deadlocks  at  steady  state  is  proportional  to  the  average  number  of 
transactions  in  the  systemt  and  the  rate  of  conflicts  is  proportional  to  the  mean 
value  of  the  product  n*s  where  n  ■  number  of  free  transactions  in  the  system 
and  s  *  total  number  of  transactions  in  the  system. 

(Note  that,  by  using  the  more  refined  assumption  AR  we  would  get 
r4.  "  M  HT~  ”  E(n)] 
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(E(n*s)  -  E(n)  ]  ). 
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Note  that  the  results  of  this  lemma  contradict  the  assumption  in  [R,  79]  that 
the  multiprogramming  level  has  no  affect  on  the  performance  in  2PL  systems. 

6.1  SXMULATX&N  ANALYSIS  OF  THE  2PL  ALGORITHMS 

Recently,  some  simulation  studies  are  performed  to  analyze  the  performance 
of  the  2FL  methods.  IL-N,  81].  The  overall  system  model  is  very  similar  to  the 
model  here  as  described  in  section  5.  There  are  more  general  assumptions  in  the 
simulation  which  include  the  Erlangian  service  times  at  the  granules  and  constant 
times  for  the  hops.  The  study  focused  on  larga  databases  (3000  £  M  <_  12000) 
and  simulations  are  carried  out  for  both  the  "simple"  and  general  2PL.  Their 
major  results  agree  very  favorably  to  our  results.  For  example: 

(1)  The  number  of  transactions  in  the  system  affects  the  overall 
performance  (thereby  contradicting  Ries  [R,  79}  and  agreeing 
with  the  results  of  lemma  6.4. 

(2)  The  number  of  transactions  under  deadlock  increases  almost 
linearly  with  the  number  of  locks  held  by  the  active  transactions 
and  the  multiprogramming  level,  validating  our  results  of  equations 
6.1  and  lemma  6.4  (r  proportional  to  1  E(s)). 

d  8 

(3)  The  rate  of  conflicts  increases  more  than  linearly  with  the 
multiprogramming  level  (close  to  quadratic  in  heavy  traffic) 
validating  our  second  result  of  lemma  6.4  (r  proportional  to 
lE(s.n)). 

6 

(4)  Similar  locable  unitls  imply  a  smaller  probability  of  conflict, 
as  we  expect  from  the  analysis  of  section  5. 


7.0  CONCLUSION 

We  developed  an  analytic  model  for  the  general  2PL,  which  can  be  used  to  estimate 
the  stead  state  rates  of  conflicts  and  deadlocks  in  the  system.  Under  reasonable 
assumptions  for  transaction  behavior,  we  found  that  the  rate  of  deadlocks  is  pro¬ 
portional  to  the  average  nunber  of  transactions  in  the  system  and  the  rate  of 
conflicts  is  proportional  to  the  mean  of  the  product  of  the  number  of  free  trans¬ 
actions  in  system  multiplied  by  the  total  number  of  transactions  in  system.  Since 
the  rate  of  deadlocks  is  equal  to  the  rate  of  restarts  and  because  restarts  affect 
directly  the  number  of  transactions  in  the  system,  continued  analysis  is  important 
for  further  understanding  and  quantification  of  the  system  performance  under  this 
feedback  effect.  The  next  step  in  our  research  will  be  a  system  model  using 
decomposition  and  the  results  obtained  above. 
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SECTION  V 
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PERFORMANCE  OF  EXCLUSIVE  LOCKING 
IN  DATABASE  SYSTEMS* 

Nathan  Goodman 
Raj an  Suri 
Yong  C.  Tay 


*To  appear  in  the  Proceedings  of  the  Second  ACM  SIGACT-SIGMOD  Symposium 
on  Principles  of  Database  Systems,  March  1983. 


1.  INTRODUCTION 


Concurrency  control  in  database  systems  is  the  coordination  of  con¬ 
current  access  to  shared  data.  Much  is  already  known  about  the  design  of 
concurrency  control  algorithms  in  both  centralized  and  distributed  systems; 
the  majority  of  these  algorithms  are  in  fact  just  combinations  of  two  basic 
techniques — locking  and  timestamp  ordering  IBGJ ,  and  there  is  now  con¬ 
siderable  interest  in  understanding  the  performance  of  these  two  techniques. 

In  analyzing  any  system,  it  is  frequently  true  that  an  appropriate 
model  can  provide  better  insight  than  a  simulation  study,  and  at  less  cost. 
To  date,  most  performance  analyses  have  been  done  either  experimentally 
[LN1,LN2 ,MK,R]  or  with  Markovian  queueing-theoretic  means  IL,G,PL,SS]. 

This  paper  introduces  a  new  framework  for  analyzing  exclusive  locking 
that  uses  none  of  the  usual  assumptions  of  queueing  theory  besides  Little's 
Law  IK] .  For  example,  we  do  not  assume  that  certain  quantities  are 
exponentially  distributed.  The  model  presented  is  easy  to  understand  and 
costs  little  to  solve  computationally,  yet  captures  the  essential  features 
of  the  system  it  models. 

The  new  framework  is  presented  in  Section  2  in  a  general  form.  Then, 
in  Section  3,  we  use  the  framework  to  consider  the  case  where  there  is  no 
blocking.  The  resulting  model  is  applied  to  analyze  three  different  systems 
in  Sections  4,  6  and  7.  Section  5  presents  simulation  results  that  validate 
the  model  for  the  case  considered  in  Section  4.  Conclusions  about  the 
systems  studied  and  the  model  itself  are  in  Section  8. 


2. 


THE  FRAMEWORK 


The  database  is  a  collection  of  data  items.  Each  transaction  makes  a 
sequence  of  requests;  the  time  between  the  i-th  and  (i+l)-th  request  is  the 
same  for  all  transactions.  The  O-th  request  is  a  request  to  start 
(immediately  granted),  and  the  i-th  request  (i^l)  is  for  either  an 
exclusive  lock  on  a  data  item,  or  termination.  Let  prob(i-th  request  is 
for  termination)  =  p^,  PQ  =  0. 

When  a  transaction  makes  a  lock  request,  it  is  immediately  granted  if 
there  is  no  conflict;  otherwise,  the  transaction  either' waits  or  is  re¬ 
started.  Let  prob( conflict)  =p.  Requests  for  termination  are  granted 
immediately,  and  each  termination  starts  another  transaction  (so  the  number 
of  transactions  is  constant) .  A  transaction  releases  its  locks  if  and  only 
if  it  is  restarted  or  terminated. 

The  model  for  this  system  consists  of  two  parts.  The  first  is  the  flow 
diagram  illustrated  in  Figure  2.1. 

The  second  part  of  the  model  is  a  set  of  equations  describing  the 
behavior  of  the  system.  These  equations  are  derived  using  the  steady  state 
average  values  of  the  variables.  The  underlying  idea  in  our  approach  is  to 
characterize  the  system  in  terms  of  these  average  values,  instead  of  detailed 
dynamics  involving  instantaneous  values  of  each  variable.*  The  input  para¬ 
meters  to  the  model  are: 

N  —  the  number  of  transactions  in  the  system. 

D  —  the  number  of  data  items;  this  specifies  the  granularity. 

Pi  —  the  probability  of  termination  with  i  locks,  i  =  l,2,...  . 

It  was  pointed  out  to  the  authors  by  Ken  Sevcik  that  a  related  approach  has 
been  proposed  for  large  multiclass  queueing  models  [LZ] . 


is  the  number  of  transactions  with  1  locks  that  are  executing 

is  the  number  of  transactions  with  i  locks  that  are  waiting 

is  the  abort  rate  of  transactions  holding  i  locks 

is  the  rate  of  termination  of  transactions  holding  i  locks,  t0*0. 

We  refer  to  each  as  a  Stage.  The  time  T  between  the  i-th  and 

(i+l)-th  requests  is  assumed  to  be  a  function  of  i,  the  number  of 

00 

executing  transactions  N  *  Z  N.,  and  the  number  of  waiting  trans 
oo  €  j*0  3 

actions  N  =  Z  W . . 


a  function  of  l  which  is  a  measure  of  the  interval  (such  as  the 
number  of  instructions)  between  the  i-th  and  (i+l)-th  requests. 


a  function  of  Ng  and  specifying  the  time  taken  per  unit 

measure  of  the  inter-request  interval  (hence  T(i,N  ,N  )  * 

e  w 

d(i)T(N  ,N  } ) ;  T  reflects  the  effect  of  multiprogramming  on 
e  w 


the  rate  of  execution. 


ii  —  a  function  of  D,  N  =  (NQ,N^, _ )  and  W= 


characterizing  the  conflicts;  it  specifies  the  dependence  of  the 
conflict  probability  p  on  D,  N  and  W,  i.e.  p  =  n(D,N,W). 
Given  these  inputs,  the  task  is  to  determine  N  and  W,  and  hence  deduce 
the  values  of  various  performance  measures. 

In  this  paper,  we  shall  restrict  our  attention  to  the  case  where  no 
waiting  is  allowed.  There  are  two  reasons  for  this:  Firstly,  it  is 
easier  to  illustrate  the  approach  in  our  framework.  Secondly,  even  this 
simple  case  provides  some  useful  insight,  and  it  will  guide  us  in 
analyzing  the  waiting  case. 


3.  THE  NO  WAITING  CASE 


In  this  case,  a  transaction  requesting  a  lock  is  restarted  whenever 
there  is  a  conflict.  The  flow  diagram  has  the  simplified  form  in 
Figure  3.1. 


Figure  3.1.  The  No  Waiting  Case 


For  convenience  let  d(i)  be  independent  of  i,  say  d (i)  «  1  for  all  i. 

Then,  since  N  =N  and  N  =0  in  this  model, 
e  w 

(Inter-Request  Time)  T  *  T (N)  .  (3.1) 

Using  Little’s  Law,  =  r^T  in  steady  state,  where  r^  is  the  rate  at 

which  transactions  enter  stage  i.  Hence,  from  Fig.  3.1, 

00  00 

(Little's  Law)  N.  =  r.T  where  r.  =  £  t.  +  £  a.  i*0,l,2,... 

11  1  j=i  3  j-i  3  (3.2) 

Since  the  number  of  transactions  in  the  system  is  held  constant,  we  have 

CO 

(Conservation)  N  *  £  N ,  (3.3) 

j-0  3 

Now ,p.  »t./r.  or  t .  *  p . r . .  By  (3.2), 

rl  1'  1  111 

(Rate  of  Termination)  t^  *  t  piNi  i*l,2, —  (3.4) 


Similarly*  since  at  each  stage,  the  rate  of  lock  requests  is  (1-p^r^,  so 


a. 

i 


P  *  *7; - » —  *  and  thus 

*  (1-P, ) r . 


i  i 

(Abort  Rate) 


Si  =  t  P(1"Pi)Ni 


i*0,l,2, . . . 


(3.5 


Since  we  are  considering  the  case  where  there  is  no  waiting  for  a  lock. 


(Probability  of  Conflict)  p  =  tt (D,N) 


(3.6 


For  arbitrary  tt  and  T,  p — and  hence  N — can  be  evaluated  by  solving 
equations  (3.1)- (3.6)  (for  example,  by  iteration). 

To  continue  with  our  analysis,  we  shall  henceforth  assume  there  is 
uniform  access,*  i.e. 


tt(D,N) 


OO 


(3.7 


Fact  3- 1  :  The  probability  of  conflict  p  is  a  solution  to  f  (q)  *0,  where 


*  jc.qJ 


f  (q) 


j=l 


(1— q) 


-  1 


(3.8 


2  c.q 
j*0  J 


c.  =  1  and  c. 
0  l 


i-1 

n  (1-p.)  for  i=l,2,...  . 

3-0  J 


If  one  assumes  that  a  transaction  does  not  request  for  a  lock  it  already 
holds,  then  this  expression  should  in  fact  be  |  I  jN_.  -ij/(D-i)  for  a 


transaction  with  i  locks.  The  two  expressions  oonverge  if  i«  Z  jN. . 


j=l 


[TSG]  considers  non-uniform  access. 


* 


Proof.  By  (3.2), 


"i.l  ri+l  _  .  *1**1  _  .  PiV'P|1~f’i)ri  . 

TT - r~  1  t~  1 - T - (l-pXl-Pj)  . 

ill  i 


Hence 


Ni+1  -  qd-p^N. 


N.  -  N0  n  Iq<l-P  )]  -  C.q^N 
j-0  3 


for  i-0,1,. 


Using  (3.3) 


N  -  I  c  q3U  . 
j-0  1 


N0  =  ~ 


Z  c  .qJ 


Now,  by  (3.7) 


00  N  00  . 

1  "V  1  NOV.  j 

P  *  5  >i 3Cjq  0  ’  5  T  >x  3'j 


(1-q)  -  X 


Hence  the  claim. 


*  jc^q 


Z  c  qJ 
j-0  J 


Remark:  For  uniform  access,  the  effects  of  N  and  D  are  expressed 
—  through  a  single  parameter  X  -  which  we  call  the  load. 

The.  following  fact  indicates  that  f  is  well  formed. 


Fact  3.2:  Given  X  and  c.,  j-1,2,...,  f  has  exactly  one  *ero  in  [0,1] 


Proof.  Let  g(g) 


-  I  jc.gJ  and 
j-1  3 


h(g) 


(1-g)  1  c  ,g3 


3*0 


I  c.g3 

j-0  3 


-  I 


j-0 


Cjq 


j+1 


oo 

1  -  £  (c.  -c. 

j-l  3-1  3 


)q 


Since  c.«(l-p_,)c.  ,  <c.  , ,  we  have  c.  -c.^0  and  so  h(q)  decreases 
3  j  3-1  3-1  3-1  3 

as  q  increases,  whereas  g(q)  increases  as  g  increases.  But  £(g)  = 


1  2iSL  -  1,  so  f (q)  increases  as  q  increases.  The  claim  now  follows  from 
h(q) 


the  observation  that  f  (0)  - -1  <  0,  lim_f(q)  *  +°° 

cr+i" 

[0,1). 


and  f  is  continuous  on 


Furthermore,  this  zero  behaves  in  a  manner  expected  of  it. 

Fact  3.3:  Given  c^,  j-1,2,. . . ,prob (conflict)  increases  as  X  increases, 

and  liin^  p  =  0,  lim  p-1. 

X-*-0+  X-»+°° 

Proof.  For  fixed  q,  f(q)  increases  as  X  increases.  Hence  the  zero  q^ 
in  [0,1)  must  decrease  as  X  increases  since  f  is  monotonic  increasing 
on  [0,1).  (See  Fact  3.2  and  Figure  3.2).  Since  p«l-q^,  p  therefore 
increases  as  X  increases.  Now 


OO  00  w 

0<  x  I  jc.q3  <  X  I  jc,,  so  lim  X  £  jc.q?  =  0  , 
j-i  3  A  j-i  3  xV  j-i  3 


and 


I  e.q l  -  1  +  I  c.q3  >  1  . 


j-0 


'j^x  '  fix  r* 


Since  f  (q, )  *  0,  we  have 


0<  (1-qj) 


X  Z  jc  q3 


Z  c.q 
j-0  3 


j 


<  X  2  jc.q: 
3*1 


3 

rx 


Therefore 
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oo 

lim  (l-q%)  ^  lim  X  £  jc,q?  «  0,  i.e.  lim 
X-K)  X-K)  j-1  3  A  X-KT 


p  -  0. 


Suppose  lim  p  <  1.  (The  limit  exists  since  p  increases  with  X  and  is 
X-m-“ 


bounded  above  by  1.)  Then  lim  q.  *  e  for  some  0<£<1,  and 

X-»4®  A 


X  Z  jc.e3-(l-e)  Z  c.e" 


lim  f(q^)  -  iim 
X-*+<®  X-*~h» 


r  C.EJ 
j-0  3 


°0  >  00 

-  X  J  jc  ej-l+  I  (C  _-c)e3 

c  £j  L  i=1  3  j-i  3-1  3 


(see  proof  of  Fact  3.2) 


>  iim  — -i -  X  £  jc  £j  - 1 

X~  Z  c.e3  L  j=1 

j-0  3 


+  00 ,  which  contradicts  the  definition  of  q.  . 


Hence  lim  p-1. 
X-*+“> 


Using  the  zero  g,,  one  can  compute  N  using  (3.9)  and  (3.10).  From 


(3.4),  the  throughput  is  then 


w  w 

*  m  1  **  -  i  I  - 


j-l  3  T  jti  'J  5 


(3.11) 


and,  from  (3.5),  the  abort  rate  is 


00  /oo  00  \  /  00  \ 

1  a.  -  £(  1  N  -  £  p  N.)  «  §(n  -  £  P-N) 

j-o  3  T  \j-o  i  j-o  ^  3/  T  \  j-o  P3  3/ 


(3.12) 


Other  performance  measures  like  the  number  of  locks  a  transaction  holds 


when  it  restarts/terminates,  response  time  of  a  transaction  and  prob (trans¬ 


action  terminates  without  ever  restarting)  can  also  be  computed. 


V  *  . '  •  ^  V"** 


-  <_■  »  .  .  .  1 


r  .  tv 

A  ~  - 


1 


k.  CASE  WHERE  ALL  TRANSACTIONS  REQUIRE  k  LOCKS 


a 


f  lN-v 


iiai 

T 


„  _  !2ilza 2f 

l-qk+1 


N  (1-q) (1-q*) 


k+1 


(4.5) 


The  aero  in  10,1)  of  (4.2)  is  found  with  the  help  of  MACSYMA*  for 
various  values  of  X.  We  shall  compare  the  numerical  solution  to  some 
simulation  results  in  the  next  section.  First,  let  us  analyze  the  asymptotic 
behavior  of  the  system  as  predicted  by  these  formulae. 

Fact  A.  1  :  For  small  X  and  k  (i.e.  X«1  and  k  <  10  say) 


...  kN 

(i)  P  '  ^ 


(ii) 


N 


(k+l ) T 


(iii)  a  =  kpt 


Proof.  By  Fact  3.3,  q*»l  for  X«0,  and  hence  q*«l  for  small  k. 
Using  this  approximation  on  (4.2), 


so 


k 

0  *  f(q)  &--1  -  *§-  1 

I  1 
j«0 


Xk  k  N 
P-T  *  JO  ‘ 


MACSYMA  is  a  large  symbolic  manipulation  program  develoepd  at  the  MIT 
Laboratory  for  Computer  Science  and  supported  by  the  National  Aeronautics 
and  Space  Administration  under  grant  NSG  1323,  by  the  Office  of  Naval  Research 
under  grant  N00014-77-C-0641,  by  the  U.S.  Department  of  Energy  under  grant 
ET-78-C-02-4687,  and  by  the  U.S.  Air  Force  under  grant  F49620-79-C-020. 


m 


m 


For  throughput,  by  (4.4), 


N 

T 


N 


l+q+ ... +q 

Now,  (4.4)  and  (4.5)  implies  ~  -  1,  so 


(k+l)T 
_k 

q  q 


(d-P)“k-i)t 


»  kpt 


since  pwO  and  k  is  small. 


Hence,  for  small  A  and  k,  p  and  t  are  linear  in  N.  Consequently 
a  is  quadratic  in  N  (but  a/t  is  linear)  for  small  X  and  k. 

Graph  4.1  compares  these  asymptotes  with  exact  solutions  obtained  by 
solving  for  p  using  MACSYMA. 

Fact  4.2: 

lim  p  *  ~  [Vx2  +  4X  -  X 
Jr*® 


k+1  -  2  K 

Proof.  Since  lim  q  ■  0  and  — “ — =■  «*  q  +  2q  +  •  •  •  +  kq  +  •  •  • , 


k-*» 


lim  f(q) 

k-K» 


(l~q) 


<l-q)‘ 


-  x  -  *Jq>  say. 


The  zeros  of  f  are  given  by 


(l-q)2-Xq  *  0  or  q2  -  (2+X)q  +  l  »  0  . 


They  are 


q  *  j 


(2+X) 

The  zero  ij\  (0,1)  is  q  * 


±  V(2+X)2- 4  ] 
j  [2+X  -  A  Vi 


X2+4X 


,  so  p  ■  1  -  q  •  j 
Graph  4.2  shows  how  this  limit  is  reached  for  various  X. 


VX^X  -) 
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We  now  discuss  the  significance  of  these  results. 

Fact  4.1  confirms  our  intuition  that,  for  small  X,  the  transactions 
hardly  interfere  (i.e.,  seldom  conflict)  with  one  another.  For,  with  little 
interference,  almost  all  transactions  are  expected  to  terminate  without 

)( 

restarts.  The  average  number  of  locks  held  by  each  transactions  is  then 

k  lk 

so  that  there  are  j  N  locks  in  the  system,  giving  p  *  —  N).  Also,  the 

N 

response  time  is  (k+l)T  since  there  are  k+1  stages;  hence  t  *  —  . 

Finally,  if  there  is  scant  interference,  then  the  throughput  through  each 
stage  is  t,  the  abort  rate  at  each  of  the  first  k  stages  is  pt,  and  so 
a  =  kpt .  Note  that  we  have  derived  these  three  approximations  simply  by 
assuming  there  is  little  conflict  of  locks,  but  they  agree  with  the  approxi¬ 
mations  in  Fact  4.1,  which  are  obtained  by  examining  the  solution  to  f(q)  *  0. 

Fact  4.2  and  Graph  4.2  show  that  prob (conflict)  is  constant  for  large 
enough  k.  Given  XQ,  call  the  least  k  for  which  prob (conflict)  is 
within  1%  of  the  limiting  value  the  look  limit  for  X^.  That  prob (conflict) 
is  constant  must  mean  that  increasing  the  number  (k)  of  stages  makes  no 
difference  to  the  number  of  locks  held.  This,  in  turn,  must  be  because  the 
number  of  transactions  at  the  last  stage  is  negligible.  Thus,  for  a  given 

number  of  stages  k  ,  call  the  value  of  X  for  which  N.  /N*l%  the 

0  K0 
saturation  point  for  kQ.  We  expect  that,  for  a  given  X^,  if  k  is  larger 

them  its  lock  limit,  then  XQ  must  be  close  to,  if  not  larger  than  the 

saturation  point  for  k.  For  example,  for  X^  *  0.9,  the  lock  limit  is  5 

(see  Graph  4.2),  and  the  saturation  point  for  k*5  is  between  0.5  and  0.9 

(see  Graph  4.3). 

The  model  also  captures  the  phenomenon  of  thrashing  due  to  concurrency 
control  and  resource  contention.  Graph  4.4a  is  a  plot  of  t  for  T (N)  *  1 
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(i.e.  no  resource  contention).  We  see  that  t  decreases  for  large  N  due 
to  the  restarts  caused  by  the  concurrency  control.  Call  the  maxima  the 
CC-thrashing  point.  On  the  other  hand,  suppose  there  is  no  concurrency 
control.  A  typical  T  is  an  approximately  exponential  function  that  gives 
a  throughput  curve  like  that  in  Graph  4.4b  [DKLPS] .  Again,  t  decreases 
for  large  N,  but  this  is  the  effect  of  resource  contention — call  the  maxima 
the  KC-thrashing  point.  Graph  4.4c  shows  the  combined  effect  of  concurrency 
control  and  resource  contention. 

For  a  given  k,  one  would  expect  the  CC-thrashing  point  to  be  reached 
even  before  the  saturation  point:  From  Graph  4.4a,  the  CC-thrashing  point 
for  k*5  is  at  a  >0.25,  whereas  the  saturation  point  is  greater  than 
0.5  (Graph  4.3). 


5.  VALIDATION  OF  THE  MODEL 


The  values  of  p  as  a  function  of  k  and  X,  and  as  computed  by 
MACSYMA  using  (4.2),  are  tabulated  in  Table  5.1.  Tables  5.2  to  5.6  give 
various  performance  measures  computed  with  these  values. 

To  validate  the  model,  we  use  a  restricted  version  of  the  two  phase 
locking  simulation  program  written  by  Oded  Shmueli.  In  the  simulation  runs, 

D*  100,  k  varies  from  3  to  6,  and  X  from  0.1  to  0.9.  In  each  run, 

30000  requests  are  generated,  and  the  average  performance  measures  computed 
for  the  last  10000  requests  (to  avoid  transient  effects) . 

Table  5.7  and  Graph  5.1  compare  the  simulation  results  with  the  figures 
predicted  by  the  model.  The  agreement  is  remarkable.  Note  from  the  table 
that  the  ratio  a/t  varies  from  0.5  to  230.0,  so  we  have  simulated  a  high 
level  of  conflicts  in  some  of  the  experiments.  And  it  is  precisely  in  a 
situation  of  intense  conflict  that  we  expect  deficiencies  in  a  model  to  show 
temselves.  For  this  reason,  the  results  in  Table  5.1  are  reassuring. 

A  scrutiny  of  Tables  5.3,  5.5  and  5,6  reveals  something  rather  curious. 

By  fitting  the  data  with  a  polynomial,  one  can  estimate  from  Table  5.3  the 
CC-thrashing  points  for  various  k.  Using  these  estimates,  one  can  find  the 
corresponding  r  and  s  values  in  Tables  5.5  and  5.6  by  interpolation 
(again  fitting  the  data  with  a  polynomial). 

Table  5.8  shows  the  results  of  this  exercise:  at  the  CC-thrashing  point 
for  a  wide  range  of  values  of  k,  r  is  about  7  and  s  about  0.3.  Why  this 
is  »o  is  not  yet  known.  However,  from  the  value  of  r,  one  sees  that,  to 
achieve  maximum  throughput,  one  would  have  to  accept  the  cost  of  7  aborts  per 
successful  completion.  Looking  at  it  in  another  way,  to  reach  the  CC-thrashing 
point,  one  would  have  to  push  the  system  very  hard. 


.*  ,*  v  ■.« 

L.'l  . 


■•‘-v;  -N  ;  y'vX > 

•  ■  *“«■  «“•  »\»  O*  *%\%  «*’* 


1  . 


f*  - 
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6.  TRANSACTIONS  WITH  MEMORYLESS  BEHAVIOR 


Suppose  after  acquiring  each  lock,  a  transaction  terminates  with 


probability  6  (it  does  not  'remember*  how  many  locks  it  already  has).  In 


this  case  (the  flow  diagram  is  as  in  Figure  3.1),  p^  =  6  for  i>l.  From 


(3.8),  we  have 


ill 

(l-q)  00 


i  j(i-e)j"V 


1+  I  (l-6)j-1q 


t - 1  since  cQ  *  1  and 


Cj  *  (1-0) ^  1  for  j  > 1 


?/ (1- (l-6)q) ' 


(l-q)  l  +  q/(l-(l-6)q) 


(l-q)  (l+0q)  (1-  (1—6 ) q) 


As  for  throughput, 


I  X  6N, 


T  jtl  J 


by  (3.11) 


I  I  c-i<jV 


T  0 

3*1  J 


by  (3.9) 


Z  c.qJ 


iN 

T  oc 


by  (3.10) 


I  c.qJ 


6  f  g/ (1- (1-6 ) q)  1 
T  w  Ll+q/(l-(l-0)q)  J 


N  e  F _ 3 — 1 

t  L 1  +  e<3  J 


(6.1) 


(6.2) 


■."  V  v  "%*'  ■  v'.iV.’  ^ 
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Fact  6.1:  If  0,  T  and  D  are  constants,  then  t  increases  monotonic^lly 
with  N  for  large  N. 

Proof.  Recall  from  Fact  3.3  that  lim  p  =  l  or  lim  q=0.  Hence  for  fixed 

X-H-»  X"M-°° 

D  and  large  N,  q«0,  From  (6.1), 


0  =  f(q)  = 


_ l 


(1-q) (l+6q) (1- (1-6) q) 


ns 


ia_ 


1-2 (1-0) q 


-  1 


so 


SB 


/.+2(l-t) 


Substituting  this  into  (6.2) 


N  c 

t  »  —  e 

T 


[x+2(i-ej+3J 


.  e  r  n  i 

T  D[n+(2-0)dJ 

_  0  J.  (2-0)  D  1 
T  T  "  1J+  (2-6)  D  J 


Since  2  -  6 >  0,  t  increases  monotonically  with  N  (and  approaches  the  limit 

|d». 


This  fact  is  counter-intuitive,  for  one  expects  that  if  transactions 
behave  reasonably,  then  the  throughput  should  decrease  when  N  is  large 
because  of  too  many  restarts,  even  if  the  effect  of  resource  contention  is  , 
ignored.  We  conclude,  from  the  above  result,  that  the  memory less  assumption 
does  not  adequately  model  transaction  behavior. 


Figure  7.2. 
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Figure  7.2.  Figure  7.1  as  a  sum  of  two  Figure  4.1’s 


Note  that  the  probability  of  conflict  is  the  same  in  both  transaction 
classes.  By  (3.7), 

.  1  I"  v  ...00  .  4 


by  (3.9)  and  (4.1) 


p  «  ±  y  jKlk)  -  I  W- 

°  7=1  3  3=1  3 


d-q)  -  jfi  ^Xk>  +  i  jqjNoU 

p*=i  j*i  j 


tk)  n'k)  k  a)  n*£)  £ 

N  Nil  JL-  V  ia3  +  S _  JL_  V  jq- 

D  L  N  N(k)  '  jll  N  N(£)  3  =  1 


I  jq3 


+  ( 1— b) 


I  qJ 
j-o 


I  jqJ 

ini _ 

£ 

I  q3 
j-0 


by  (3.10) 


Hence  q  is  a  zero  of  g,  where 


n  ( \  = 


r  k  i  ' 

I  jq3  I  jq3 

b  2li -  +  (l-b)  Q - 


Proof.  Recall  from  the  proof  of  Fact  7.1  that  g  is  continuous  and  mono¬ 
tonic  increasing  on  [0,1).  The  proof  that  p  increases  as  X  increases 
is  therefore  the  same  as  that  in  Fact  3.3. 


Let  q  be  the  zero  of  f  in  [0,1),  where  a  =  k,£  and  f  is 
a  a  a 

as  in  the  preceding  proof  .  Since  g  =  bf^  +  (l-b)f^  where  f^  and  f^ 
are  continuous  and  monotonic,  the  zero  q^  in  10,1)  of  g  must  lie 


between  q^ 

and  q^.  Hence  p  lies  between 

pk 

and  p^ ,  where 

pa  =  1  ~  qa ' 

a  =  k,£.  The  limiting  behavior  of 

pk 

and  p^ ,  is  given  by 

Fact  7.1,  i. 

. e.  lim  p  =0  and  lim  p  =  1 

for 

cx  =  k,£,  and  it  follows 

A-*-0+  “  x++*°  Q 

that  p  has  the  same  limits.  □ 

The  variance  performance  measures  are  as  follows: 


From  (4.4) , 


(k)  _  N(k)  ( 1-q) qk 
T  l-qk+1 


.  k+l 
1-q 


N 

T 


(1-b) 


(7.2) 


From  (4.5)  , 


<k)  _  N  (1-q) (l-qk) 

T  ,  k+l 
1-q 


Fact  7-3'  For  small  X,  k  and  £, 


|  (1-b) 
T 


(1-q)  (1 

.  £ 
1-c 

(7.3) 


(i) 

P  = 

j  [bk  +  (1-b) £] 

(ii) 

t  = 

/  b  +  1-b  \  N 
\  k+l  +  £+1/  T 

9 

where 

t  -  t(k) 

+  t<£) 

(iii) 

a  = 

/  bk  ( 1-b)  £  ) 
\k+l  £+1  J 

|£N 

1  T  ' 

where 

(k) 

a  =  a 

+  a(*> 

Proof.  By  Fact  7.2,  for  small  X,k  and  £,  q’’*'!  for  l^j<£. 


Thus 


from  (7.1), 


0  -  "  F  [b  ic+i  +  (1‘b)  I+T~  J  -  1 

fH*  ll-b)  l]- 1 


and  so 


P  *  j  [bk+ 


Similarly,  by  (7.2), 


.  (k)  ,  15  v  q  n.  N  l  1  ,  r,  N  M  i  ,  1 

t  -  -  b  w  T  b  k+T#  1  “  T  {1_b)  I+I  ' 

l+q+ . . . +q 


and  by 


-  fb 777^7*  »  £*&• 

l+q+ . . . +q 

Hence  the  claim  in  (i)  and  (ii) .  d 

Once  again,  our  intuition  about  the  behavior  of  the  system  when  ).  is 
small  is  confirmed,  for  we  can  derive  the  approximations  in  Fact  7.3  with 
intuitive  arguments  much  as  is  done  in  Section  4.  For  example,  few 
conflicts  are  expected  and  most  transactions  should  complete.  The  average 
number  of  locks  held  by  a  length  k  transaction  would  then  be  k/2,  and 
that  for  length  £  transaction  £/2.  The  average  number  of  locks  in  the 
system  would  thus  be 

N  y+n  j  '  N  b  2  +  (1~b)  J  I  '  50  that  p  *  D  b  2  (1_b)  J J  ' 

as  i  (i)  above. 

When  considering  two  transaction  classes,  the  principal  interest  is  in 
how  their  interaction  affect  the  performance  of  each.  In  particular,  we 
consider  now  the  contribution  of  the  shorter  transaction  to  the  throughput 
and  abort  rate. 


Fact 


increases  as  X  increases,  and 


..  t(k) 
lim  — - — 

XV 


b+(l-b) 


k+l  ' 


(ii)  -  increases  as  X  increases,  and 

£1 


lim  - 

X-0  a 


Proof.  (i)  From  (7.2) 


b+ (1-b) 


£  k+l  ' 
£+1  k 


1  -  q  1-q  1  - q  ' 


b/  b+  (1-b) 


.  £+1  k 

1-q  q 


It  is  easy  to  show  that  — £+T  ’  — k" 

1-q  q 


decreases  when  q  decreases  in 


[0,1),  or  as  X  increases  (Fact  7.2),  so  — - —  is  an  increasing  function 


of  X. 


Now  for  small  X,  q«l  (Fact  7.2) 


t(k)  / 

—  =b/( 


b  +  (1-b) 


l+q+. . . +q 


b//b+( i-b)  MM 


b  +  1-b) 


For  large  A,  q*0,  so  —  —  w  b/ (b+ ( 1-b ) q  )  *1  since  £>k. 


(ii)  From  (7.3), 


»(k)  /  i  *  ,  *+l\ 

—  -  v/b.u-b)  •  ia_  . 

\  1-q  1-q  / 

It  is  again  a  straightforward,  if  tedious,  exercise  to  show  that 
increases  when  A  increases  (for  k>2). 

When  A  «0,  we  have  q«l  and 

=  b/L  d-b)  ia-v  •  1  •  -iiaii --^k  \ 

3  \  l+q+ . . .  +q  l+q+ . . . +qk  ^  / 

“  b/(b  -  d-b)  jJj  •  . 


For  large  ,  q«0,  so 


w  b/(b  +  (1-b))  *  b. 


Since  transactions  requiring  £  locks  continue  to  suffer  restarts 

(k) 

after  the  (k+l)-th  stage,  we  expect  a  /a<b;  Fact  7.4  gives 


b+  (1-b) 


b  a'  ' 

-  <  - -  <  b 

£  k+1  a 

£+1  k 


Furthermore,  the  lower  bound  is  close  to  b  for  a  wide  range  of  values  of 
k  and  £,  so  that  the  proportion  of  restarts  is  highly  insensitive  to  the 
number  of  locks  required  by  each  class.  Similarly,  because  more  locks  are 
required,  transactions  requiring  £  locks  contribute  less  to  the  throughput 
Fact  7.4«gives 


b+ (1-b) 


For  high  enough  load  A,  insufficient  numbers  make  it  through  the  extra 
stages,  and  — - —  «  1. 

Lessons  learnt  from  Section  4  are  also  applicable  here.  For  example, 
given  X  and  b,  k  and  £■  should  not  be  greater  than  the  lock  limits 
for  bX  and  b(l-X),  respectively.  Conversely,  for  given  k  and  i ,  X 
and  b  must  be  chosen  so  that  bX  and  (l-b)X  are  less  than  the 
saturation  points  of  k  and  £,  respectively,  or  there  will  be  no  through 
put  to  speak  of  for  one  class  or  the  other.  If  X  is  increased,  whether 
the  CC-thrashing  point  for  a  particular  class  is  reached  first  depends  on 


-  .-VV  \  ■  -  , 


\V_V  V.-, V.*,  , 

.  t  k  i  .  ^  V-  ■  a  'A  *  •  '■  -* 
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8.  CONCLUSIONS 

We  draw  two  sets  of  conclusions,  one  concerning  what  the  model  tells 
us  of  exclusive  locking  without  waiting,  and  the  other  about  the  model 
itself. 

If  we  believe  that  this  model  has  captured  the  essential  features  of 
the  locking  policy  studied,  then  it  reveals  the  following  results,  assuming 
uniform  access: 

(Rl)  The  effect  of  the  number  of  transactions  on  prob (conflict)  is 
indistinguishable  from  that  of  granularity. 

(R2)  For  small  loads,  the  probability  of  conflict,  throughput  and 
ratio  of  abort  rate  to  throughput  are  linear  in  N. 

(R3)  For  throughput  to  be  significant,  transactions  cannot  request 
more  locks  than  the  lock  limit  for  a  given  load  and,  conversely,  the  load 
should  not  exceed  the  saturation  point  for  a  given  number  of  locks  required. 
We  expect  the  concepts  of  lock  limit  and  saturation  point  to  endure  even 
if  there  is  nonuniform  access,  when  the  model  is  extended  to  other  locking 
policies,  and  indeed  for  any  reasonable  model  for  locking. 

(R1*)  Assuming  a  memoryless  lock  request  behavior  leads  to  increasing 
throughput  even  for  large  N  (if  resource  contention  is  disregarded) ,  and 
is  unreasonable.  (Hence  some  standard  queueing  models  may  be  bad.) 

(R5)  For  a  system  with  two  transaction  classes,  the  one  requiring 
less  locks  contribute  more  to  the  throughput  and,  in  fact,  dominates  under 
high  loads,  whilst  the  ratio  of  aborts  is  essentially  just  the  ratio  of 
transactions . 

As  to  the  desirability  of  the  model  itself,  the  model  has  the 


following  in  its  favor: 


(HI)  It  is  simple,  and  makes  no  assumptions  about  probability 
distributions,  etc. 

(M2)  It  cleanly  separates  the  issues  of  resource  contention  and  concur 
rency  control,  thus  allowing  one  to  study  the  effect  of  each  on  the  system. 

(M3)  It  is  a  'high-level'  model.  The  model  allows  us  to  ask  questions 
about  a  wide  range  of  performance  characteristics,  and  not  just  preliminary 
measures  such  as  the  probability  of  conflict.  In  fact,  the  model  accepts 
as  input  the  function  t  specifying  prob (conflict)  for  different  system 
states.  Though  it  is  assumed  that  there  is  uniform  access  in  most  of  this 
paper,  the  assumption  is  not  necessary  for  the  use  of  the  model — one  can 
perform  a  numerical  analysis  based  on  a  n  that  is  empirically  obtained 
for  the  particular  system  under  study.  (Restricted  forms  of  non-uniform 
access  can  also  be  treated  analytically  ITSG) . )  Similarly,  the  model 
accepts  any  T  that  may  be  appropriate  for  the  system  concerned. 

(M4)  It  is  flexible.  It  is  for  convenience,  for  instance,  that  T 
is  assumed  the  same  for  different  stages;  if  it  is  different,  it  can  be 
handled  in  a  way  much  as  is  done  for  p^.  We  have  also  seen  how  the  model 
can  handle  transactions  of  indeterminate  length,  and  multiple  transaction 
classes.  (TSG]  treats  a  system  with  queries  (which  share  locks)  and 
updates  (which  require  exclusive  locks) . 

(M5)  It  has  been  validated  for  the  case  of  all  transactions 
requiring  the  same  number  of  locks. 

(M6)  It  can  be  extended.  Work  is  now  in  progress  on  the  waiting 
case.  It  is  obvious  that  the  flow  diagram  in  Figure  1  is  equally 
applicable  to  timestamp  ordering  algorithms,  and  can  be  extended  to  more 
sophisticated  concurrency  control  algorithms,  such  as  those  that  abort 


waiting  transactions. 


This  paper  contains  but  the  first  of  what  this  model  and  its 


extensions  have  to  offer. 
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APPENDIX  2:  Tables 


Table  5.1 

Probability  of  conflict  p*l-q,  where  q 
f  in  (4.2). 

Table  5-2 

D  .  (l-q)qk 

Unnormalized  throughput  t  =  —  A  - -“y 

.  k  1  - q 

factor  X  — is  tabulated. 

,  k+1 

1  -q 


is  the  zero  in  [0,1) 


(see  (4.4)).  Only  the 


of 


Table  5-3 

\  (1— q)qk 

Normalized  throughput  t'  *  (k+l)t;  only  the  factor  (k+1) A  - kTX 

1  -q 

is  tabulated. 

Without  concurrency  control,  the  response  time  of  a  transaction  is 
(k+l)T,  and  the  throughput  N/(k+l)T.  Hence  the  drop  in  throughput 
in  Table  5.2  when  k  gets  larger  is  partly  due  to  just  the  time  it 
takes  for  a  transaction  to  accumulate  its  locks.  Table  5.3  factors 
this  out,  so  as  to  make  the  effect  of  aborts  on  throughput  easier 
to  discern. 


Table  5.^ 


Abort  rate  a 


X 


(1-q) ( 1-q 

,  k+1 

1-q 


—  X  (A  q) (1  q  J.  (see  (4.5)).  Only  the  factor 
T  ,  k+1 

1-q 

is  tabulated. 


Table  5-5 

Number  of  aborts  per  transaction. 

k 

Let  u*prob(a  transaction  completes  without  aborting).  Then  u  *=  q 

""  00 

and  expected  number  of  aborts  before  completion  *  I  jd-ul^u  = 

00  _  _  k  ^ 

(l-u)u  I  j  (l-u)j  1  ^  U^U.  =  -^—3—  ,  which  is  also  the  ratio 

j-1  (l-(l-u)  qK 

of  abort  rate  to  throughput  (see  (4.4)  and  (4.5)). 


A 


Table  5.6 


Let  s  *  i  (number  of  locks  an  average  transaction  holds). 
Number  of  locks  in  the  system  *  Dp  (see  (3.7)),  so  s  ■  (®jp) 


JL 
kX  • 


Table  5-7 


In  the  simulation,  T*Nh,  where  h  is  the  average  time  it  takes  for 

v 

1  (l-q)q 

the  scheduler  to  handle  one  request.  Hence  t  ■=  r-  - and 


a  * 


1  (1-q)  (1-q  ) 


1-q 


k+1 


h  ,  k+1 

1-q 

The  throughput  and  abort  rate  are  given  in  the 


simulation  with  h  as  the  time  unit.  The  predicted  values  in  this 


table  are  therefore  given  in  the  same  time  unit,  i.e.,  t  * 

.  .  <i-q>  n.-q*)  . 

,  k+1 

1-q 


1-q 


k+1 


and 


Table  5.8 

The  r  and  s  values  at  the  cd-thrashing  point  for  various  values 


Table  5.1 


probability  of  conflict  p 


1 

2 

3 

4 

5 

6 

7 

8 

9 

0.0015 

0.0148 

0 . 1327 

0.0030 

0.0291 

0.2343 

0.0045 

0,0431 

0.3141 

0.0060 

0.0575 

0.3769 

0.0074 

0.0706 

0.4202 

0.0089 

0.0834 

0.4706 

0.0104 

0  <  0958 
0.5059 

0.0119 

0,1087 

0.5364 

0.0133 

0.1205 

0.5627 

1 

n 

a 

3 

4 

5 

6 

7 

C 

9 

0.0020 

0.0196 

0.1646 

0.0040 

0.0385 

0.2764 

0.0060 

0.0566 

0.3561 

0.0079 

0,0741 

0.4159 

0.0099 

0.0909 

0.4629 

0.0119 

0.1071 

0.5012 

0.0130 

0.1220 

0.5334 

0.0157 

0.1364 

0.5606 

0.0177 

0.1511 

0.5840 

1 

2 

3 

4 

5 

6 

7 

8 

9 

0.0025 

0.0244 

0.1903 

0.0050 
0.0476 
0.30  41 

0.0074 

0.0689 

0.3804 

0.0099 

0.0093 

0.4366 

0.0123 

0.1007 

0.4805 

0.0148 

0.1266 

0.5160 

0.0171 

0.1438 

0.5457 

0.0196 

0.1604 

0.5710 

0.0219 

0.1756 

0.5930 

1 

2 

3 

4 

5 

6 

7 

8 

9 

0.0030 

0.0291 

0.2101 

0.0060 

0.0557 

0,3220 

0.0009 

0.0800 

0.3950 

0.0119 

0.1031 

0.4401 

0.0148 

0.1236 

0.4895 

0,0176 
0. 1438 
0.5231 

0,0204 

0.1618 

0.5514 

0.0232 

0.1790 

0.5757 

0.0260 

0.1948 

0.5969 

1 


4 


6  7  8 


9 


0.0035 

0.0070 

0.0103 

0.0137 

0.0170 

0,0204 

0.0236 

0.0269 

0.0301 

0.0329 

0.0628 

0.0901 

0,1150 

0.1372 

0.1575 

0.1763 

0.1935 

0.2101 

0.2248 

0.3342 

0.4041 

0.4547 

0.4944 

0,5267 

0.5542 

0.5779 

0.5986 

1 

2 

3 

4 

5 

6 

7 

8 

9 

0.0040 

0.0079 

0.0118 

0.0156 

0.0193 

0.0231 

0.0268 

0.0304 

0.0338 

0.0375 

0.0706 

0.0991 

0. 1251 

0,148? 

0.1694 

0.1883 

0.2057 

0.2212 

0.2361 

O.T5421 

0.4093 

0.4583 

0.4970 

0.5290 

0.5556 

0.5709 

0.59?4 

Teble  5.1  (continued) 


k*9 


1 

2 

3 

4 

5 

6 

7 

0 

9 

.00 

0.0045 

0.0000 

0.0131 

0,0174 

0.0216 

0.0257 

0.0290 

0.0337 

0.0376 

.0 

0.0412 

0.0766 

0.1071 

0.1342 

0.1575 

0.1703 

0.1974 

0.2145 

0 . 2302 

• 

0.2447 

0.3473 

0.4125 

0.4603 

0.4902 

0.5296 

0.5563 

0.5793 

0.5997 

k*10 

1 

2 

3 

4 

5 

6 

7 

8 

9 

.00 

0.0050 

0.0098 

0.0146 

0.0192 

0.0230 

0.0203 

0.0374 

0.0370 

0.0412 

.0 

0.0453 

0.0829 

0.1143 

0.1416 

0. 1653 

0.1857 

0.2045 

0.2212 

0,2366 

♦ 

0.2509 

0.3511 

0.4145 

0,4615 

0.4990 

0.5301 

0 , 5565 

0.5797 

0.5998 

k*l  1 

1 

2 

3 

4 

5 

6 

7 

8 

9 

.00 

0.0055 

0.0100 

0,0159 

0,0210 

0,0259 

0.0307 

0.0355 

0.0400 

0.0446 

.0 

0.0494 

0.0884 

0.1205 

0.1402 

0.1715 

0.1922 

0.2101 

0.2266 

0.2418 

• 

0.2554 

0,3532 

0.4159 

0.4624 

0.4995 

0.5303 

0.5567 

0.5797 

0.6000 

k«12 

1 

2 

3 

4 

5 

6 

7 

8 

9 

.00 

0.0060 

0.0117 

0.0173 

0.0227 

0,0200 

0.0332 

0.0302 

0.0431 

0.0479 

.0 

0.0521 

0.0934 

0.1259 

0.1533 

0.1763 

0.1968 

0.2151 

0.2308 

0.2459 

♦ 

0.2593 

0.3540 

0.4166 

0.4627 

0.4797 

0,5305 

0.5569 

0,5790 

0.6000 

k*13 

1 

2 

3 

4 

5 

6 

7 

8 

9 

.00 

0.0064 

0,0125 

0.0136 

0.0244 

0.0301 

0.0355 

0.0400 

0.0459 

0.0510 

.0 

0.0557 

0.0975 

0.1312 

0.1575 

0.1010 

0.2006 

0.2100 

0.2343 

0.2487 

• 

0.2620 

0.3561 

0.416? 

0.4629 

0,5000 

0,5305 

0.5569 

0.5793 

0.6000 

k«14 

1 

2 

3 

4 

5 

6 

7 

8 

9 

.00 

0.0069 

0.0135 

0,0199 

0.0260 

0.0320 

0.037/ 

0.0432 

0.0486 

0.0538 

.0 

0.0593 

0.1015 

0.1349 

0.1618 

0.1843 

0.2038 

0.2212 

0.2366 

0.2509 

• 

0.2636 

0.3565 

0.4172 

0.4632 

0.5000 

0.5307 

0.5569 

0.5790 

0.6000 

k«15 

1 

2 

3 

4 

5 

6 

7 

8 

9 

.00 

0.0073 

0.0144 

0.0211 

0.0276 

0.0339 

0.0390 

0.0456 

0.0511 

0,0565 

.0 

0.0619 

0.1047 

0.1379 

0.1646 

0.1870 

0.2063 

0.2236 

0.2390 

0.2526 

♦ 

0.2652 

0.3573 

0.4176 

0.4632 

0.5000 

0.5307 

0.5569 

0,5793 

0.6000 

Table  S.2 


u*mnrt?] jzed  throughput  t 
(to  slot  the  real  values*  aultielw  by  D/T) 


=  3 


1 

2 

3 

4 

5 

6 

7 

8 

9 

00 

0.0002 

0.0005 

0.0007 

0.0010 

0.0012 

0.0015 

0.0017 

0.0020 

0.0022 

0 

0.0024 

0.0048 

0.0070 

0.0091 

0.0112 

0.0131 

0.0150 

0.0167 

0.0184 

0.0199 

0.0321 

0.0390 

0.0429 

0.0448 

0.0455 

0.0454 

0.0448 

0.0439 

*  4 

1 

2 

3 

4 

5 

6 

7 

8 

9 

00 

0.0002 

0.0004 

0.0006 

0.0008 

0.0010 

0.0012 

0.0014 

0.0015 

0.0017 

0 

0.0019 

0,0037 

0.0053 

0.0068 

0.0082 

0.0094' 

0.0106 

0.0117 

0.0126 

0.0135 

0.0189 

0.0207 

0.0208 

0.0202 

0.0192 

0.0181 

0.0170 

0.0159 

*  5 

1 

2 

3 

4 

5 

6 

7 

8 

9 

00 

0.0002 

0.0003 

0.0005 

0.0007 

0.0008 

0,0010 

0.0011 

0.0013 

0.0014 

0 

0.0016 

0.0029 

0.0042 

0.0052 

0.0061 

0.0069 

0.0076 

0.00B2 

0.0088 

0.0092 

0.0112 

0.0110 

0.0102 

0.0093 

0.0083 

0.0075 

0.0067 

0 . 0060 

*  6 

1 

2 

3 

4 

5 

6 

7 

8 

9 

00 

0.0001 

0.0003 

0.0004 

0.0006 

0.0007 

0.0008 

0.0009 

0.0011 

0.0012 

0 

0.0013 

0.0024 

0.0033 

0.0040 

0.0046 

0.0051 

0.0055 

0.0059 

0 . 0061 

0.0063 

0.0067 

0.0060 

0.0051 

0.0044 

0,0037 

0.0032 

0.0027 

0.0023 

«  7 

1 

2 

3 

4 

5 

6 

7 

8 

9 

00 

0.0001 

0.0002 

0.0004 

0.0005 

0.0006 

0,0007 

0.0008 

0.0009 

0.0010 

0 

0.0011 

0.0020 

0.0026 

0.0031 

0.0035 

0.0038 

0.0040 

0.0042 

0.0043 

0.0043 

0.0040 

0.0033 

0.0026 

0.0021 

0.0017 

0.0014 

0.0011 

0.0009 

*  8 

1 

2 

3 

4 

5 

6 

7 

8 

9 

00 

0.0001 

0.V002 

0.0003 

0.0004 

0,0003 

0.0006 

0.0007 

0.0008 

0.00^9 

0 

0.0009 

0.0016 

0,0021 

0.0025 

0.0027 

0.0020 

0.0029 

0.0030 

0.0030 

0.0030 

0.Q025 

0.0013 

0.0014 

0.0010 

0.0008 

0.0006 

0.0005 

0.0004 

-  9 

1 

2 

3 

4 

5 

6 

7 

8 

9 

00 

0.0001 

0.0002 

0.0003 

0,0004 

0.0005 

0.0005 

0.0006 

0.0007 

0.0008 

0 

0.0008 

0.0014 

0,0017 

0.0019 

0.0021 

0.0021 

0.0021 

0.0021 

0.0021 

0.0021 

0.0015 

0.0010 

0,0007 

0.0005 

0.0004 

0.0003 

0.0002 

0.0001 
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Table  5.2  (continued) 


k«10 


1 

2 

3 

4 

5 

6 

7 

8 

9 

*00 

0.0001 

0.0002 

0.0003 

0.0003 

0.0004 

0.0005 

0.0005 

0.0006 

0.0007 

.0 

0.0007 

0.0011 

0.0014 

0.0015 

0.0016 

0.0016 

0.0016 

0.0016 

0.0015 

♦ 

0.0015 

0.0009 

0.0006 

0.0004 

0.0002 

0.0002 

0.0001 

0.0001 

0.0001 

k»U 

1 

2 

3 

4 

5 

6 

7 

8 

9 

.00 

0.0001 

0.0002 

0.0002 

0.0003 

0.0004 

0.0004 

0.0005 

0 . 0005 

0.0006 

.0 

0.0006 

0.0010 

0.0011 

0.0012 

0.0012 

0.0012 

0.0012 

0.0011 

0.0011 

♦ 

0.0010 

0.0006 

0.0003 

0.0002 

0.0001 

0.0001 

0.0001 

0.0000 

0.0000 

k*12 

1 

2 

3 

4 

5 

6 

7 

8 

9 

.00 

0.0001 

0.0001 

0.0002 

0.0003 

0,0003 

0.0004 

0.0004 

0.0005 

0.0005 

.0 

0.0005 

0.0008 

0.0009 

0.0009 

0.0009 

0.0009 

0.0009 

0.0008 

0.0008 

• 

0.0007 

0.0004 

0.0002 

0.0001 

0.0001 

0.0000 

0.0000 

0.0000 

0.0000 

k»13 

1 

2 

3 

4 

5 

6 

7 

8 

9 

.00 

0.0001 

0.0001 

0.0002 

0.0002 

0.0003 

0.0003 

0.0004 

0.0004 

0.0004 

.0 

0.0005 

0.0007 

0.0007 

0.0007 

0.0007 

0.0007 

0.0006 

0.0006 

0.0006 

• 

0.0005 

0.0002 

0.0001 

0.0001 

0.0000 

0.0000 

0.0000 

0.0000 

0.0000 

k«14 

1 

2 

3 

4 

5 

6 

7 

8 

9 

.00 

0.0001 

0.0001 

0.0002 

0.0002 

0.0003 

0.0003 

0.0003 

0,0004 

0.0004 

.0 

0.0004 

0.0006 

0.0006 

0.0006 

0.0006 

0.0005 

0.0005 

0.0004 

0.0004 

• 

0.0004 

0.0001 

0.0001 

0.0000 

0.0000 

0.0000 

0.0000 

0.0000 

0.0000 

k*15 

1 

2 

3 

4 

5 

6 

7 

8 

9 

.00 

0.0001 

0.0001 

0.0002 

0.0002 

0,0002 

0.0003 

0.0003 

0.0003 

0.0*004 

.0 

0.0004 

0.0005 

0.0005 

0.0005 

0.0004 

0.0004 

0.0004 

0.0003 

0.0003 

• 

0.0003 

0.0001 

0,0000 

0.0000 

0.0000 

0.0000 

0.0000 

0.0000 

0.0000 

Table  5.3 


normalized  throughput  t'm(Ml)¥t 
(to  set  the  real  values»  multiply  by  D/T) 


k*  3 


1 

o 

4 

3 

4 

5 

6 

7 

8 

9 

.00 

0.0010 

0.0020 

0.0030 

0.0040 

0.0049 

0.0059 

0.0069 

0.0079 

0.0088 

.0 

0.0098 

0.0191 

0. 02.80 

0.0365 

0.0446 

0.0524 

0.0598 

0.0660 

0.0735 

♦ 

0.0798 

0.1282 

0.1562 

0.1718 

0.1792 

0.1810 

0,1817 

0.1793 

0.1758 

«r 

ii 

1 

2 

3 

4 

5 

6 

7 

8 

9 

.00 

0.0010 

0,0020 

0.0030 

0.0039 

0 .0049 

0.0059 

0.0060 

0.0077 

0.0087 

.0 

0.0096 

0.0185 

0.0266 

0.0341 

0.0409 

0.0472 

0.0531 

0.0504 

0.0631 

• 

0.0676 

0.0945 

0.1033 

0.1039 

0.1008 

0.0960 

0.0905 

0.0350 

0.0797 

k*  5 

1 

2 

3 

4 

5 

6 

7 

8 

9 

.00 

0.0010 

0.0020 

0.002? 

0.0039 

0.0048 

0.0050 

0,0067 

0.0076 

0.0085 

.0 

0.0094 

0.0176 

0.0249 

0.0313 

0.0368 

0.0416 

0.0458 

0.0494 

0.0526 

♦ 

0.0553 

0.0672 

0.0663 

0.0614 

0.0556 

0.0500 

0,0440 

0.0401 

0.0359 

k«  6 

1 

2 

3 

4 

5 

6 

7 

8 

9 

.00 

0.0010 

0.0020 

0.0029 

0.0039 

0.0048 

0,0057 

0.0066 

0.0074 

0.0083 

.0 

0.0091 

0.0167 

0.0230 

0. 0282 

0,0325 

0 . 0359 

0.0388 

0.0410 

0.0428 

• 

0.0442 

0.0469 

0.0419 

0.0360 

0.0306 

0.0260 

0.0221 

0.018? 

0.0162 

k*  7 

1 

4. 

3 

4 

5 

6 

7 

8 

9 

.00 

0.0010 

0.0020 

0.002? 

0.0038 

0,0047 

0.0056 

o 

o 

o 

0.0073 

0.0081 

.0 

0.0089 

0.0158 

0.0211 

0,0251 

0,0282 

0.0305 

0.0322 

0.0335 

0.0342 

♦ 

0.0348 

0.0323 

0.0263 

0.0210 

0,0160 

0.0135 

0.0109 

0.0003 

0.0072 

k«  8 

1 

2 

3 

4 

5 

6 

7 

8 

9 

.00 

0.0010 

0.0019 

0.0029 

0.0038 

0.0046 

0.0055 

0.0063 

0.0070 

0.0078 

.0 

0.0085 

0.0147 

0.0191 

0.0221 

0.0242 

0.0255 

0.0264 

0.0268 

0.02?1 

* 

0.0270 

0.0221 

0.0165 

0.0123 

0,0092 

0.0069 

0.0053 

0.0041 

0.0032 

Tsblff  5.3  (continued) 


k*  9 


i 

2 

3 

4 

5 

6 

7 

8 

9 

.00 

0.0010 

0.0019 

0.0028 

0.0037 

0.0045 

0 . 0053 

0.0061 

0.0068 

0.0075 

.0 

0.0082 

0.0136 

0.0171 

0.0192 

0.0205 

0.0213 

0.0215 

0.0215 

0,0212 

♦ 

0.0200 

0.0152 

0.0104 

0.0072 

0.0050 

0.0036 

0.0026 

0.0019 

0.0014 

O 

rt 

II 

1 

2 

3 

4 

5 

6 

7 

8 

9 

.00 

0.0010 

0.0019 

0.0028 

0.0036 

0.0044 

0.0052 

0.0057 

0.0066 

0.0072 

.0 

0.0078 

0.0125 

0.0152 

0.0166 

0.0173 

0.0175 

0.0174 

0.0171 

0.0166 

« 

0.0160 

0.0103 

0.0063 

0,0042 

0.0027 

0.0018 

0.0013 

0.0009 

0.0006 

k*ll 

1 

2 

3 

4 

5 

6 

7 

8 

9 

.00 

0.0010 

0.0019 

0.0027 

0,0036 

0.0043 

0.0050 

0.0057 

0.0063 

0.0069 

.0 

0.0075 

0.0114 

0.0134 

0.0143 

0.0145 

0.0143 

0,0140 

0.0135 

0.0129 

♦ 

0.0123 

0.0071 

0.0040 

0.0024 

0,0015 

0.0009 

0.0006 

0.0004 

0.0003 

k=12 

1 

2 

3 

4 

5 

6 

7 

8 

9 

.00 

0.0010 

0.0019 

0.0027 

0.0035 

0,0042 

0.0049 

0.0055 

0.0061 

0.0066 

.0 

0.0071 

0.0104 

0.0118 

0.0122 

0.0122 

0.0117 

0.0112 

0.0107 

0.0100 

• 

0.0094 

0.0048 

0.0025 

0.0014 

0.0008 

0.0005 

0.0003 

0.0002 

0.0001 

k«13 

1 

2 

3 

4 

5 

6 

7 

8 

9 

.00 

0.0010 

0.0018 

0.0026 

0.0034 

0,0041 

0.0047 

0.0053 

0.0058 

0.0063 

.C 

0.0067 

0.0094 

0.0103 

0.0104 

0.0101 

0.0096 

0.0089 

0.0084 

0 . 0078 

♦ 

0.0072 

0.0033 

0,0016 

0,0008 

0,0004 

0.0002 

0.0001 

0.0001 

0.0001 

k«14 

1 

2 

3 

4 

5 

4 

7 

8 

9 

.00 

0.0010 

0.0018 

0.0026 

0,0033 

0,0039 

0,0045 

0.0050 

0.0055 

0.0059 

.0 

0.0063 

0.0085 

0.0090 

0 . 0088 

0.0084 

0.0078 

0.0072 

0.0066 

0.0060 

• 

0.0035 

0  .£022 

0.0010 

0.0005 

0.0002 

0.0001 

0.0001 

0.0000 

0 . 00  Q.0 

k«13 

1 

~  2 

3 

4 

5 

6 

7 

8 

9 

.00 

0.0009 

0.0010 

0.0025 

0.0032 

0.0038 

0.0043 

0.0048 

0.0052 

0.0056 

.0 

0.0059 

0.0077 

0,0079 

0.0075 

0.0070 

0.0063 

0.0057 

0.0052 

0,0047 

• 

0.0042 

0.0015 

0.0006 

0.0003 

0.0001 

0.0001 

0.0000 

0.0000 

0.0000 

•  7  ■  ~  ^ .“l1 yw^v.'. .■■,■<?  V'.I'  i  H/'i  •  V 1 


m 
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Table  5.4 


abort  rate  a 

(to  *et.  the  real  values#  multiple  bw  0/ T) 


k=  3 


1 

2 

3 

4 

5 

6 

7 

8 

9 

.00 

0.0000 

0.0000 

0.0000 

0.0000 

0.0000 

0.0000 

0.0001 

0.0001 

0.0001 

.0 

0.0001 

0.0004 

0.0010 

0.0010 

0.0027 

0.0037 

0.0053 

0.0069 

0. 0006 

• 

0.0106 

0.0373 

0.0820 

0.1346 

0.194? 

0.2610 

0.3312 

0,4051 

0.4017 

k=  4 

1 

2 

3 

4 

5 

6 

7 

8 

9 

.00 

0.0000 

0.0000 

0,0000 

0.0000 

0.0000 

0.0001 

0.0001 

0.0001 

0.0001 

.0 

0.0007 

0.0006 

0.0014 

0.0025 

0.0038 

0.0054 

0.0072 

0.0093 

0.0117 

• 

0.0142 

0.0501 

0.0775 

0.1577 

0.2221 

0.2711 

0,3637 

0.4390 

0.5163 

k*  5 

1 

2 

3 

4 

5 

6 

7 

8 

9 

.00 

0.0000 

0,0000 

0.0000 

0.0000 

0.0001 

0.0001 

0.0001 

0.0001 

0.0002 

.0 

0.0002 

0.0008 

0.0010 

0,0031 

0.0048 

0.0067 

0.0070 

0.0115 

0.0143 

♦ 

0.0173 

0,0574 

0.1079 

0.1702 

0,2350 

0.3053 

0.3777 

0.4530 

0.5301 

k*  6 

1 

2 

3 

4 

5 

6 

7 

8 

9 

.00 

0.0000 

0.0000 

0.0000 

0.0000 

0,0001 

0.0001 

0.0001 

0.0002 

0.0002 

.0 

0.0003 

0.0010 

0.0021 

0.0037 

0,0056 

0.0077 

0.0104 

0.0133 

0.0163 

* 

0.0177 

0.0623 

0.1161 

0.176? 

0.2426 

0.311? 

0.3842 

0.4590 

0,5359 

k«=  7 

1 

2 

3 

4 

5 

6 

7 

8 

9 

.00 

0.0000 

0.0000 

0.0000 

0.0000 

0.0001 

0.0001 

0.0001 

0.0002 

0.0002 

.0 

0.0003 

0.0011 

0,0025 

0.0042 

0.0064 

0.008? 

0,0116 

0.0147 

0.0180 

♦ 

0.0215 

0,0655 

0,119? 

0.1 807 

0,2462 

0.3152 

0.3872 

0.4617 

0.5382 

k«  8 

1 

2 

3 

4 

5 

6 

7 

8 

9 

.00 

0.0000 

0.0000 

0.0000 

0.0001 

0.0001 

0.0001 

0.0002 

0.0002 

0.0003 

.0 

0.0003 

0.0013 

0.0020 

0.0047 

0.0070 

0,0077 

0.0126 

0.0150 

0.0192 

• 

0.0229 

0.-0676 

0.1220 

0.1027 

0.2400 

0.3170 

0,3886 

0.462? 

0.5392 

Table  5.4 


*  9 


00  0.0000  0.0000  0.0000  0.0001 

0  0.0004  0.0014  0.0030  0.0051 

0.0240  0.0609  0.1233  0.1030 

*10 


00  0.0000  0.0000  0.0000  0.0001 

0  0.0004  0.0016  0.0033  0.0055 

0.0247  0.0699  0.1241  0.1344 

*11 

12  3  4 

00  0.0000  0.0000  0.0000  0.0001 

0  0.0005  0.0017  0.0035  0.0058 

0.0253  0,0704  0.1246  0,1849 

*12 

12  3  4 

00  0.0000  0.0000  0.0000  0.0001 

0  0.0005  0.0010  0.0037  0.0060 

0.0257  0.0700  0.124?  0,1850 

*13 


00  0.0000  0.0000  0.0001  0.0001 

0  0.0005  0.0019  0.0038  0.0067 

0.0261  0,0711  0.1250  0.1852 

=  14 


00  0.0000  0.0000  0.0001  0.0001 

0  0.0006  0.0020  0.0040  0.0064 

0.0263  0.0712  0.1251  0.1053 

*15 


00  0.0000  0.0000  0.0001  0,0001 

0  0.0006  0.0070  0.0041  0.0065 

0.0265  0.0714  0.1253  0.1853 


( cont j  nu#d > 


5 

6 

7 

8 

9 

0,0001 

0.0076 

0.2489 

0.0001 

0.0103 

0.3176 

0.0002 

0.0134 

0.3893 

0.0002 

0.0167 

0.4633 

0.0003 

0 . 0202 
0.5396 

5 

6 

7 

8 

9 

0.0001 

0.0080 

0,2494 

0,0002 

0.0108 

0.3130 

0.0002 

0.0140 

0.3895 

0.0003 

0.0174 

0.4637 

0.0003 

0.0209 

0.5398 

5 

6 

7 

8 

9 

0.0001 

0.0084 

0.2497 

0.0002 

0.0113 

0.3181 

0.0002 

0.0145 

0.3897 

0.0003 

0.0179 

0.4637 

0.0004 

0.0215 

0.5400 

5 

6 

7 

B 

9 

0.0001 

0.0086 

0.2498 

0.0002 

0.0116 

0.3183 

0,0003 

0.0149 

0.3873 

0.0003 

0.0183 

0.4639 

0.0004 

0.0219 

0.5400 

5 

6 

7 

8 

9 

0.0001 

0.0009 

0.2500 

0.0002 

0.0119 

0.3183 

0.0003 

0.0157 

0.3890 

0.0003 

0.0186 

0.463? 

0.0004 

0 . 0222 
0.5400 

5 

6 

7 

8 

9 

0.0002 

0.0091 

0.2500 

0.0002 

0.0121 

0.3134 

0.0003 

0,0154 

0.3899 

0.0004 

0.0180 

0.463? 

0.0005 
0 . 0225 
0.5400 

5 

6 

7 

8 

9 

0.0002 

0.0093 

0.2500 

0.0002 

0.0123 

0.3184 

0.0003 

0.0156 

0.3899 

0.0004 

0.0190 

0 . 4639 

0.0005 

0.0227 

0.5400 

T able  5 . 5 


the  ratio  r  of  short  rate  to  throughput 
si*  t  the  nuniher  of  aborts  per  transaction 


1 

2 

3 

4 

5 

6 

7 

8 

9 

0.00 45 
0.0457 
0.5320 

0,0090 

0.0927 

1.2276 

0.0136 

0.1412 

2.0994 

0.0101 

0.1944 

3.1345 

0,0227 

0.2458 

4.3502 

0.0272 

0.2986 

5.7406 

0.0310 

0.3529 

7.2915 

0.0364 

0.4125 

9.0358 

0.0410 

0.4699 

10.961 

1 

n 

.4. 

3 

4 

5 

6 

7 

8 

9 

0.0000 

0.0874 

1.0529 

0.0161 

0.1699 

2.6470 

0.0242 

0.2625 

4.0160 

0.0324 

0.3605 

7.5904 

0,0406 

0.4641 

11,02'' 

0.0409 

0.5735 

15.160 

0.0572 

0.6030 

20.090 

0.0656 

0.7902 

25.834 

0.0740 

0.9257 

32.399 

1 

n 

3 

4 

5 

6 

7 

8 

9 

0.0120 

0.1314 

1 .0730 

0,0253 

0.2763 

5.1275 

0,0301 

0.4290 

9.9526 

0.0510 

0.5959 

16.619 

0.0641 

0.7701 

25.433 

0.0773 

0.9680 

36.640 

0.0901 

1 .1738 
50.653 

0,1041 

1 . 3964 
67.819 

0.1171 

1.6261 

88.541 

1 

2 

3 

4 

5 

6 

7 

8 

9 

0.0101 
0. 1941 
3.1172 

0.0365 

0.4105 

9.2900 

0.0552 

0.6496 

19.400 

0.0742 

0.9215 

34.395 

0,0934 

1 . 2066 
55.520 

0.1123 

1.0390 

34.033 

0,1315 

1 .0030 
121.64 

0. 151 A 

2.2650 

170,45 

0.1713 

2,6705 

232.21 

1 

n 

A- 

3 

4 

5 

6 

7 

8 

9 

0.0240 

0.2637 

4.9447 

0.0500 

0.5745 

16.246 

0,0751 

0.9363 

36,457 

0.1015 

1 . 3526 
60.790 

0 .1276 

1 ,E092 
117.46 

0 , 1550 
2,3201 
107.06 

0.1822 

2.0863 

284.63 

0.2100 

3,5077 

417.75 

0.2383 

4.2124 

594.13 

1 

2 

3 

4 

5 

6 

7 

8 

9 

0.0325 

0.3581 

7,6202 

0-4650 

0 . 7968 
27j  493 

0.0973 

1 .3045 
66.492 

0.1336 

1.9132 

133,05 

0.1609 

2.60B7 

242.96 

0.2052 

3,4158 

411.66 

0.2424 

4.3074 

655.04 

0.279A 

5.3126 

1011.2 

0.3168 

6.3B79 

1505.4 
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Tahle  5<5  (continued) 


k*  9 


1 

2 

3 

4 

5 

6 

7 

8 

V 

.00 

0.0412 

0.0030 

0.1263 

0.1711 

0,2174 

0.2643 

0.3120 

0.3617 

0.4123 

*0 

0.4607 

J .0495 

1.7731 

2.6580 

3.6700 

4.8562 

6.2307 

7.7792 

9.5313 

♦ 

11.1502 

45.406 

118.04 

255.56 

495.09 

886.29 

1500.7 

2421.4 

3786.3 

k*10 

1 

2 

3 

4 

5 

6 

7 

8 

9 

.00 

0.0511 

0.1035 

0.1503 

0.2142 

0.2726 

0.3322 

0.4633 

0.4576 

0.5235 

*0 

0.5905 

1.3761 

2.3646 

3.6053 

5.0893 

6.7900 

0.0401 

11 . 100 

13.883 

♦ 

16.901 

74 .1513 

210,23 

406.66 

1002.7 

1903.2 

3398.0 

5805.9 

9497.6 

k=l  1 

1 

2 

3 

4 

5 

6 

7 

8 

9 

.00 

0,0622 

0.1267 

0,1934 

0.2623 

0.3348 

0.4096 

0.4301 

0.5674 

0.6521 

.0 

0.7465 

1.7607 

3. 1055 

4.0391 

6.9210 

9.4695 

12.309 

15.888 

20.022 

♦ 

24.635 

119.50 

369.20 

920.31 

2024.5 

4072.1 

7703.2 

13813 

23840 

k=12 

1 

2 

3 

4 

5 

6 

7 

8 

9 

♦  00 

0.0744 

0.1512 

0.2329 

0.3160 

0.4060 

0.4989 

0.5955 

0.695? 

0.8020 

.0 

0.9012 

2.2427 

4.0247 

6.3623 

9.2476 

12.068 

17.202 

22.298 

20.547 

« 

5.6442 

191 ,30 

641.37 

1724.6 

4070.4 

0719.7 

17472 

33030 

59603 

k=13 

1 

2 

3 

4 

5 

6 

7 

8 

9 

.00 

0.0365 

0. 17B3 

0.2772 

0,3703 

0.4372 

0.5997 

0.7179 

0.8418 

0.9739 

.0 

1.1069 

2.7933 

5.2227 

B. 2367 

12.406 

17.300 

23.758 

31 . 155 

40.144 

« 

50,905 

304.65 

1109.2 

3232.9 

0191,0 

10574 

39436 

78613 

)  49010 

k«14 

1 

2 

3 

4 

5 

6 

7 

8 

9 

.00 

0.1011 

0.2099 

0.3249 

0,4461 

0,5776 

0.7131 

0.8569 

1.0092 

1.1701 

.0 

1.3522 

3.4764 

6.6104 

10.029 

16.332 

23.313 

32. 106 

42.032 

56.113 

• 

71,543 

477.97 

1910.7 

6066 . 0 

16303 

39825 

8900? 

If  7100 

372528 

k*15 

1 

2 

3 

4 

f. 

6 

7 

8 

9 

.00 

0.1169 

0.2429 

0,3779 

0.5221 

0.6770 

0.8403 

1.0145 

1 .1978 

1.3932 

.0 

1.6083 

4.2577 

0.2655 

13.039 

21.314 

31 . 030 

43.538 

59.107 

77.057 

♦ 

100.02 

757.02 

3322.1 

11301 

32767 

84868 

200895 

445300 

93)321 

Table  5 . 6 


s  =■  (number  of  locks  an  averese  transaction  holds>/k 


k=  3 


1 

2 

3 

4 

5 

6 

7 

8 

9 

.00 

0.4993 

0.4985 

0.4978 

0.4970 

0.4963 

0.4955 

0.4948 

0.4941 

0.4933 

.0 

0.4926 

0.4854 

0.4785 

0.4791 

0.4709 

0.4634 

0.4564 

0.4531 

0.4463 

♦ 

0.4423 

0.3905 

0.3490 

0.3141 

0.2355 

0.2615 

0.2409 

0 . 2235 

0.2084 

k=  4 

1 

2 

3 

4 

5 

6 

7 

8 

9 

.00 

0.4990 

0.4980 

0.4970 

0.4960 

0.4950 

0.4941 

0.4931 

0.4921 

0.4912 

.0 

0.4902 

0.4808 

0.4717 

0.4630 

0.4545 

0.4464 

0.4358 

0.4264 

0.4197 

• 

0.4114 

0.3455 

0.2967 

0.2599 

0.2315 

0.2089 

0.1905 

0.1752 

0.1622 

k=  5 

1 

2 

3 

4 

5 

6 

7 

8 

9 

.00 

0.4988 

0.4975 

0.4963 

0.4950 

0.4938 

0.4926 

0,4886 

0.490? 

0.4869 

.0 

0.4878 

0.4762 

0.4593 

0.4463 

0.4349 

0.4221 

0.4110 

0.4009 

0.390? 

• 

0 . 35)06 

0.3041 

0.2536 

0.2183 

0,1922 

0.1720 

0 , 1559 

0.1427 

0.1318 

k-  6 

1 

2 

3 

4 

5 

6 

7 

8 

9 

.00 

0.4985 

0.4970 

0.4955 

0.4941 

0,4926 

0.4885 

0.4351 

0.4843 

0.4816 

.0 

0.4854 

0.4643 

0.4446 

0.4297 

0.4119 

0.3995 

0.3852 

0.3729 

0.3608 

♦ 

0.3502 

0,2604 

0.2195 

0.1867 

0.1632 

0.1453 

0.1313 

0.1199 

0.1105 

k=  7 

1 

2 

3 

4 

5 

6 

7 

8 

9 

.00 

0.4903 

0 , 4965 

0 . 4901 

0,4896 

0.4859 

0.4851 

0.4822 

0.4796 

0.4773 

.0 

0.4697 

0.4485 

0.4290 

0.4109 

0.3920 

0.3751 

0.3597 

0.3456 

0.3335 

♦ 

0.3212 

0.2387 

0.1924 

0 .1624 

0.1413 

0 .1254 

0,1131 

0.1032 

0.0950 

k=  8 

1 

2 

3 

4 

5 

6 

7 

8 

9 

.00 

0.4900 

0.4960 

0.4900 

0.4361 

0.4830 

0.4803 

0.477? 

0.4742 

0.4697 

.0 

0.4692 

0.4414 

0.4129 

0.3910 

0,3705 

0.3530 

0.3363 

0.3214 

0.3072 

• 

0.2951 

0.^138 

0.1706 

0.1432 

0 .1242 

0.1102 

0.0992 

0.0905 

0.0832 

Table  5 <6  (continued) 


.=  9 


1 

2 

3 

4 

5 

6 

7 

8 

9 

00 

0.4970 

0.4901 

0.4061 

0.4031 

0.4805 

0.4763 

0 , 4723 

0.4634 

0 . 4646 

0 

0.4581 

0.4250 

0.3968 

0.3728 

0.3501 

0.3302 

0.3134 

0.2979 

0.2042 

0.2719 

0.1929 

0.1528 

0.1279 

0.1107 

0.0981 

0.0383 

0.0805 

0.0740 

=  10 

1 

2 

3 

4 

5 

6 

7 

8 

9 

00 

0.4975 

0.4901 

0.4861 

0.4806 

0.4764 

0.4713 

0.5336 

0.4622 

0.4581 

0 

0.4535 

0.4145 

0.3009 

0.3541 

0.3306 

0.3094 

0.2921 

0.2765 

0 . 2 629 

0.2509 

0.1755 

0.1382 

0.1154 

0.0998 

0.0333 

0.0795 

0.0725 

0.0666 

=  11 

1 

2 

3 

4 

5 

6 

7 

8 

9 

00 

0.4973 

0.4901 

0.4331 

0.4762 

0.4711 

0.4655 

0.4610 

0.4549 

0.4507 

0 

0.4494 

0.4019 

0.3651 

0.3368 

0.3118 

0.2913 

0.2729 

0.2575 

0.2443 

0.2322 

0.1605 

0.1260 

0.1051 

0,0908 

0.0303 

0.0723 

0.0639 

0.0606 

II 

1 

4. 

3 

4 

5 

6 

7 

8 

9 

00 

0.4970 

0.4059 

0.4804 

0.4724 

0.4666 

0.4606 

0.4546 

0,4486 

0.4434 

0 

0.4344 

0.3091 

0.3497 

0.3193 

0.2938 

0.2733 

0.2560 

0.2404 

0.2276 

0.2160 

0,1470 

0.1157 

0.0964 

0.0833 

0.0737 

0.0663 

0.0604 

0.0556 

=  13 

1 

2 

3 

4 

5 

6 

7 

8 

9 

00 

0.4092 

0,4023 

0.4781 

0,4690 

0.4626 

0.4550 

0.4480 

0.4413 

0.4356 

0 

0.42B6 

0.3749 

0.3364 

0.3030 

0.2705 

0.2572 

0.2404 

0.2253 

0 . 2126 

0.2015 

0,1370 

0.1069 

0.0890 

0.0769 

0.0680 

0.0612 

0.0550 

0.0513 

=  14 

1 

2 

3 

4 

5 

6 

7 

8 

9 

00 

0.4095 

0.4027 

0.4737 

0.4644 

0,4577 

0.4491 

0.4413 

0.4341 

0.4273 

0 

0.4233 

0.3626 

0.3713 

0.2889 

0.2633 

0.2426 

0.2257 

0.2113 

0. 1992 

0.1003 

0.1273 

0,0993 

0.0B27 

0,0714 

0.0632 

0.0568 

0  ♦  0  o  1 0 

0.0476 

in 

¥-4 

II 

1 

2 

3 

4 

5 

6 

7 

8 

9 

00  0.4097  0.4797  0.4699  0.4603  0.4571  0.4427  0.4345  0.4762  0.4t66 

0  0.417C  0,3491  0.3065  0,2743  0.2493  0.2293  0.2130  0,1991  0.1071 

0.1760  0.1191  0.0920  0.0772  0.0667  0.0590  0.0530  0.0403  0.0444 


rt 


Table  5.7 


Comparison  of  values  of  performance  measures 
as  predicted  by  the  model  with  simulation  results  (D=100) 


Probability  of  conflict 


predicted  value 
simulation  result 


predicted  value 
Simulation  result 


predicted  value 
simulation  result 


predicted  value 
simulation  result 


Abort  rate  a 


0.133 
0. 133 


0.165 
0.162 


0.314 

0.320 


0.356 

0.356 


0.395 

0.397 


0.428 

0.425 


0.453 

0.459 


0.506 
0.501 


0.533 

0.527 


0.563 


0.584 

0.586 


0. 585' 


0.551 

0.597 

0.551 

0.588 

predicted  value 
simulation  result 


predicted  value 
simulation  result 


)6  0.273 

)2  0.271 

0.390 

0.384 

0.473 

0.469 

L  0.3 

0.5 

0.7 

0.332 

0.444 

0.328 

0.446 

tt 


•  5 

Table  5. 

V-v.. 

169- 

,J_  (continued) 

,  . ....  .  ,  .  r-  rj  * 

_*  1-  - - J 

*y-;V 

0.1 

0.3 

0.5 

0.7 

0.9 

edicted  value 

0.173 

0.366 

0.472 

0.540 

0.589 

mulation  result 

0.161 

0.361 

0.469 

0.536 

0.580 

• ,  ’•  , 

*  6 

r 

0.1 

0.3 

0.5 

0.7 

0.5 

ThrouahDut 


predicted  value 
simulation  result 


predicted  value 
simulation  result 


predicted  value 
simulation  result 


0.135 

0.141 


0.069 

0.071 


0.092 

0.097 


0.037 

0.038 


0.020 

0.020 


0.040 

0.040 


0.019 

0.018 


0.009 

0.010 


0.026 

0.027 


0.011 

0.011 


0 . 01d 
0.018 


0.007 

O.OC7 


CC-thrashing 

r 

s 

point 

0.645 

6.4 

0.25 

0.358 

6.4 

0.27 

0.249 

7.3 

0.28 

0.175 

7.5 

0.29 

0.120 

7.2 

0.29 

0.095 

7.0 

0.30 

0.074 

6.9 

0.31 

k 

10 

11 

12 

13 

14 

15 

CC-thrashing 

point 

0.060 

0.049 

0.045 

0.037 

0.031 

0.026 

6.6 


0.3 
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1 .  Introduction 

A  database  system  (DBS)  prooeaaea  read  and  write  command a  issued  by 
u sera’  transactions  to  access  the  database.  If  a  transaction  falls  in 
midstream,  or  if  the  system  fails,  the  database  may  be  left  in  an 
incorrect  state.  Por  example,  if  a  money  transfer  transaction  fails 
after  posting  its  debit  but  before  posting  its  corresponding  credit, 
then  the  acoounts  are  left  unbalanced.  The  recovery  algorithm  of  a  DBS 
avoids  these  lnoorrect  states  by  ensuring  that  the  database  only 
Includes  updates  that  are  produced  by  transactions  that  execute  to  com¬ 
pletion.  This  paper  is  a  survey  of  recovery  algorithms  for  centralized 
and  distributed  DBS's. 

Computer  systems  can  fail  in  many  ways,  only  some  of  which  are  han¬ 
dled  by  DBS  recovery  algorithms.  He  limit  our  attention  to  dean 
failures  in  which  a  transaction,  the  system,  or,  in  the  case  of  a  dis¬ 
tributed  DBS,  one  site  of  the  system  simply  stops  running.  He  do  not 
oonslder  traitorous  failures  in  which  oomponents  continue  to  run  but 
perform  incorrect  notions  (see  [Do,  FSL]).  He  further  limit  attention 
to  soft  failures  in  which  the  oontents  of  main  memory  are  lost,  but  the 
contents  of  secondary  memory  (disk)  remain  lntaot.  He  do  not  oonslder 
methods  for  recovering  from  disk  failures,  although  methods  similar  to 
those  in  this  paper  apply  (see  [Qr,  QMBLL,  HB2,  Li,  Lo,  Pel]). 

He  describe  a  model  of  oentralized  DBS  reoovery  in  Section  2.  He 
present  four  oannonlcal  types  of  oentralized  DBS  recovery  algorithms  in 
Seotions  3  through  6.  He  describe  reoovery  algorithms  for  distributed 
DBS's  in  Seotlon  7. 


2.  A  Model  Of  Centralized  Database  System  Beoovery 


He  model  a  centralized  database  system  as  a  scheduler,  a  reoovery 


systea,  and  storage. 


■•ad/WrlU/Coaalt/Abcrt 

operations 


toad/Vrlto/Cceelt/Abert/Sootart 

operations 


Storage 

The  storage  ooaponent  oonalats  of  buffer  atoraae  and  stable 
storage.  Both  are  divided  Into  physical  pages  of  equal  and  fixed  size. 
Buffer  storage  models  sain  memory.  Buffer  storage  is  relatively  fast, 
but  of  Halted  capacity »  and  It  doesn't  survive  systea  crashes.  Stable 
storage  aodels  disk  senary  and  It  is  relatively  slow,  of  (alaost)  unlim¬ 
ited  capacity,  and  It  does  survive  orasbes. 

The  "atahaae  oonsists  of  a  set  of  logical  pages.  Ve  assuae  that 
one  physical  oopy  (usually  the  aost  up-to-date  copy)  of  each  logloal 
page  Is  stored  In  a  portion  of  stable  storage  called  the  stable  data¬ 
base.  Other  portions  of  stable  storage  aay  be  used  by  the  recovery  sys¬ 
tea  as  nonvolatile  scratch  space  In  ways  that  will  be  described  later. 


Iraoaaotloaa 

g  transaction  is  a  program  that  oan  read  froa  or  write  into  the 
database.  A  transaction  can  Issue  four  types  of  oonaands:  lead,  Vrlte, 
Commit,  and  Abort,  lead  onuses  a  page  to  be  read  froa  the  database. 
Write  onuses  a  new  oopy  of  a  logical  page  to  be  written  into  the  data¬ 
base.  Coiniit  tells  the  systea  that  the  transaction  has  terminated  and 
that  all  of  Its  updated  pages  should  be  permanently  reflected  in  the 
database.  Abocl  tells  the  systea  that  the  transaction  has  terminated 


abnormally  and  that  the  pages  it  wrote  into  should  be  returned  to  their 
previous  state.  (Commit  and  Abort  nay  be  Issued  by  a  process  control¬ 
ling  the  transaction,  rather  than  by  the  transaction  itself.)  A  transac¬ 
tion  can  have  only  one  Commit  or  Abort  processed. 

A  transaction  is  active  if  it  has  begun  executing  but  has  not  yet 
had  its  Commit  or  Abort  processed. 

Notation:  Each  command  is  subscripted  by  the  transaction  that 
lasued  it.  For  example,  Bead^Pj)  ia  a  Bead  issued  by  transaction  T*  on 
page  J. 

Ihfi  Scheduler 

The  scheduler  oontrols  the  order  in  which  Beads,  Writes,  Commits, 
and  Aborts  are  passed  to  the  recovery  system.  Although  the  scheduler 
allows  commands  from  different  transactions  to  be  interleaved,  it 
guarantees  that  the  resulting  execution  is  serializable.  An  execution 
is  serializable  if  the  effeot  is  exactly  the  same  as  if  the  transactions 
had  been  exeouted  serially,  one  after  the  next,  with  no  concurrency  at 
all.  Many  scheduling  algorithms  for  attaining  seriallzablllty  are  known 
[BG1 ,  2];  versions  of  all  of  them  are  compatible  with  the  recovery 
algorithms  described  in  this  paper. 

The  scheduler  also  guarantees  that  the  execution  is  recoverable. 
An  execution  is  recoverable  if,  for  each  transaction  Tit  is  not  com¬ 
mitted  until,  for  each  page  read  by  Ti#  the  transaction  that  last  wrote 
that  page  is  committed. 

Beooverabllity  is  needed  to  avoid  errors  such  as  the  following. 
Suppose  TA  reads  a  page  last  written  by  Tj  (which  is  still  active), 
T^  writes  another  page  Pj_,  and  commits.  Now,  suppose  Tj  fails  and  is 
aborted.  Aborting  Tj  causes  its  write  on  P^  to  be  undone,  thereby 
rendering  T^'a  input  Invalid.  But,  slnoe  T^  cannot  be  aborted  after 
having  been  committed,  T^'a  updates  to  P^  must  remain  in  the  database 
even  though  its  input  la  Invalid. 

For  definitions,  we  assume  that  the  scheduler  uses  papa-level  two- 
nhaaa  looking  (2PL)  [EGLT].  Before  outputting  Beadj^Pj)  (reap. 
Vrltei(Pj)),  the  scheduler  sets  a  read  lock  (reap,  write  lock)  on  page 
Pj  for  transaction  Tj..  Two  transactions  cannot  concurrently  own  con- 
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f  noting  locks  od  the  sane  page,  where  read  locks  oonfliot  with  write 
locks  and  write  looks  conflict  with  read  and  write  locks.  If  the 
scheduler  receives  an  operation  for  which  it  can't  set  the  corresponding 
lock,  it  delays  the  operation  until  the  lock 'can  be  set. 

Vben  the  scheduler  receives  a  Conunit^  or  an  Abort^,  it  forwards  the 
operation  directly  to  the  recovery  system.  When  the  recovery  system 
acknowledges  that  the  operation  has  been  processed,  the  scheduler  then 
releases  all  the  locks  held  by  Ti# 

Two-phase  locking  ensures  serializabillty  (see  [BG2,  EGLT]  for 
proofs).  The  version  of  2FL  presented  above  also  ensures  recoverability 
by  requiring  that  a  transaction  hold  its  write  locks  until  its  Commit  or 
Abort  is  processed. 

Jhs.  Recovery  Sxatem 

The  recovery  system  processes  the  Read,  Write,  Commit,  and  Abort 
commands  it  receives  from  the  scheduler.  It  also  handles  system 
failures. 

A  system  failure  can  Interrupt  the  PBS  at  any  moment.  It  causes 
all  processing  to  stop  and  the  contents  of  buffer  storage  to  be  lost. 
After  the  system  recovers,  transactions  that  were  active  at  the  time  of 
the  failure  cannot  oontinue  executing  because  the  contents  of  main 
memory  are  now  useless.  Thus,  after  the  failure  and  before  processing 
any  other  commands,  the  recovery  system  processes  the  restart  command, 
whose  effect  is  to  abort  all  active  transactions. 

To  handle  failures  properly,  it  is  essential  that  the  Commit  oom- 
mand  be  Implemented  in  a  single  instruction,  normally  a  page  write.  If 
it  were  to  require  more  than  one  instruction,  a  system  failure  could 
interrupt  a  partially  completed  Commit,  making  it  ambiguous  whether  the 
transaction  should  be  aborted  during  restart.  Said  differently,  each 
transaction  must  always  be  in  one  of  three  states:  active,  oommltted,  or 
aborted,  and  each  state  change  must  be  Implemented  by  an  atomic  instruc¬ 
tion  execution. 

structuring  Scratch  Space 

There  are  several  types  of  Information  that  a  recovery  algorithm 
stores  in  stable  scratch  space.  It  may  store  the  identifiers  of 


transactions  that  have  committed,  called  the  nnmnH t.  Hat.  In  this  case, 
the  single  instruction  that  Implements  Commit^  iB  usually  a  write  that 
adds  TA  to  the  commit  list.  The  reoovery  algorithm  may  also  store  a 
list  of  Identifiers  of  transactions  that  are  active,  called  the  active 
list,  and  those  that  have  aborted,  oalled  the  abort  list. 

Reoovery  algorithms  often  store  copies  of  pages  that  were  recently 
written  on  an  audit  trail  (sometimes  called  a  journal  or  log).  Por  each 
write  processed  by  the  recovery  algorithm,  the  audit  trail  may  contain 
the  identifier  of  the  transaction  that  performed  the  write,  a  copy  of 
the  newly  written  page  (oalled  an  gUtfin-lagge.) ,  and  a  copy  of  the  physi¬ 
cal  page  in  the  stable  database  that  was  overwritten  by  the  write 
(oalled  a  before-ima^a).  Different  algorithms  vary  considerably  in  the 
information  they  keep  on  the  audit  trail  and  in  how  they  structure  that 
information. 

Dndo  and  Redo 

Recovery  algorithms  also  differ  in  the  time  at  which  they  write 
pages  into  the  stable  database.  They  may  perform  such  writes  before, 
concurrently  with,  or  after  the  atomic  Instruction  that  oommits  the 
transaction  that  last  wrote  those  pages. 

Suppose  that  a  page  vxtten  by  an  active  transaction  is  written 
into  the  stable  database  before  the  transaction  oommits.  If  the  tran¬ 
saction  aborts  due  to  a  system  of  transaction  failure,  the  recovery 
algorithm  must  undo  the  write  by  restoring  the  previous  copy  (before- 
image)  of  the  page 

Suppose  that  a  page  written  by  an  active  transaction  is  not  written 
into  the  stable  database  before  the  transaction  commits.  If  a  system 
failure  occurs  after  the  transaction  commits  but  before  the  page  is 
written  into  the  stable  database,  the  recovery  algorithm  must  redo  the 
write  by  moving  the  page  to  the  stable  database. 

In  every  recovery  algorithm,  the  after-images  produced  by  a  tran¬ 
saction  must  be  written  to  stable  storage  (the  database  or  scratch 
space)  before  the  transaction  oommits.  This  is  called  the  anmmi t  rule. 
If  it  is  violated,  a  system  failure  shortly  after  a  transaction  TA  com¬ 
mits  oould  leave  the  recovery  algorithm  with  no  stable  oopy  of  T^'s 
after-images,  making  it  impossible  to  redo  T^. 
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Every  recovery  algorithm  must  also  obey  the  log  ahead  rule;  if  an 
after-image  is  written  to  the  stable  database  before  the  transaction 
that  wrote  it  commits,  then  the  before- image  of  that  page  must  first  be 
written  to  the  audit  trail.  Otherwise,  a  system  failure  oould  occur 
after  the  after-image  is  in  the  stable  database  but  before  the  before- 
image  is  in  the  audit  trail,  in  which  case  the  write  oould  not  be 
undone. 

Categorisation  nl  Recovery  Algorithms 

Recovery  algorithms  can  be  categorized  based  on  the  timing  of 
updates  to  the  stable  database.  There  are  four  types  of  recovery  algo¬ 
rithms:  ones  that  require  undo  but  not  redo,  redo  but  not  undo,  both 
undo  and  redo,  and  neither  undo  nor  redo.  These  types  of  algorithms  are 
described  in  Sections  3-6. 


3.  Algorithms  That  Undo  But  Don’t  Redo 

For  each  type  of  recovery  algorithm,  we  present  a  generic  algorithm 
based  on  our  database  system  model  and  then  we  list  example  implementa¬ 
tions.  We  describe  this  generic  version  by  explaining  how  each  command 
is  processed.  In  all  of  the  algorithms,  the  first  command  processed  for 
T^  should  add  Tj_  to  the  active  lia<~ 

For  each  operation,  we  mark  by  ■{Ack}"  the  point  at  which  the 
recovery  system  can  acknowledge  to  the  scheduler  that  the  operation  has 
been  completed.  Sometimes  the  operation  has  additional  work  to  do  after 
the  acknowledgement  is  sent. 

Read^Pj).  Copy  Pj  from  the  stable  database  into  a  buffer.  {Ack} 

Writei(Pj).  Copy  the  before- image  of  Pj  (from  the  stable  database) 
to  the  audit  trail.  {Ack}  Then*  (after  the  disk  acknowledges  the  write 
in  the  audit  trail),  write  the  new  copy  of  Pj  into  the  stable  database. 

•  In  every  algorithm,  we  use  "then*  to  mean  "wait  for  the  previous  stej. 
to  complete  before  pr  seeding  to  the  next  step”. 
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CoHBiti.  Hake  sure  all  pages  written  by  T*  are  in  the  stable  data¬ 
base.  Then  write  T^  into  the  oonnlt  list.  {Aok}  Then  delete  it  from 
the  aotive  list. 

Abort^.  Write  Ti  into  the  abort  list.  Then  undo  all  of  Ti'e 
writes  by  reading  their  before- images  from  the  audit  trail  and  writing 
then  back  into  the  stable  database.  {Ack}  Then,  delete  TA  froa  the 
active  list. 

Restart.  Prooeaa  Aborti  for  each  Ti  on  the  active  list.  {Aok} 

In  this  algorithm.  all  pages  written  by  a  transaction  are  written 
into  the  stable  database  before  the  transaction  oonnlt s.  Thus,  redo  is 
never  needed,  but  an  abort  nay  require  undo. 

It  is  actually  not  necessary  to  write  an  af ter-lnage  into  the 
stable  database  after  the  before-lnage  is  written  into  the 
audit  trail.  The  af  ter-lnage  oould  be  left  in  buffer  storage  for 
awhile,  provided  it  is  written  to  the  audit  trail  before  the  transaction 
commits  as  required  by  the  oonnlt  rule. 

This  algorithm  obeys  the  log  ahead  rule  in  processing  Write1(Pj)> 
the  before-inage  of  Pj  is  written  to  the  audit  trail  before  the  after¬ 
image  is  written  to  the  stable  database. 

The  order  in  which  writes  are  applied  to  stable  storage  is  quite 
sensitive  in  this  (and  nost  other)  recovery  algorlthns.  In  this  algo- 
rlthn,  for  exanple,  in  processing  oonnlt^  it  is  ineorreet  to  delete  T^ 
from  the  active  list  before  writing  it  into  the  oonnlt  list. 

Remember  that  a  system  failure  can  occur  during  the  processing  of  a 
Restart.  So  Restart  nust  also  take  oare  to  write  pages  to  stable 
storage  in  order  that  it  will  be  resilient  to  an  ays ten  failure  (fol¬ 
lowed  by  another  Restart). 

After  Commit^  or  Aborts  has  been  processed,  the  audit  trail 
ooples  of  pages  written  by  T^  are  no  longer  needed  and  can  be  returned 
to  free  spaoe.  The  algorithm  for  garbage  collecting  these  audit  trail 
pages  depends  principally  on  the  audit  trail's  data  structure.  Ve  will 
not  discuss  garbage  oolleotion  issues  for  any  of  the  recovery  methods 
described  in  this  paper. 


2Hft  Prime  Algorithm 


This  type  of  reoovery  algorithm  la  used  in  a  database  system  pro¬ 
duct  offered  by  Prime  Computers  [Du],  and  the  Adaplex  database  system 
being  developed  at  CCA  [CFLNR]. 

In  Prime's  algorithm,  each  page  in  the  stable  database  has  a 
pointer  to  its  before-image  in  the  audit  trail.  Each  before-image  in 
the  audit  trail  points,  In  turn,  to  the  next  older  before- image  of  the 
same  page.  Also,  each  physical  page  carries  the  transaction  Identifier 
of  the  transaction  that  wrote  that  particular  oopy.  And,  for  each 
active  transaction  there  is  a  convenient  way  to  obtain  a  list  of  all 
pages  it  has  written. 

The  page  pointers  are  used  for  two  purposes.  First,  to  process  an 
Abort,  the  pointer  in  each  stable  database  page  makes  it  easy  to  undo 
the  aborted  transaction's  writes.  Second,  they  help  avoid  concurrency 
control  conflicts  between  queries  and  updates,  as  follows. 

A  Query  is  a  read-only  transaction.  Beads  issued  by  queries  are 
not  locked  in  the  scheduler  but  are  passed  directly  to  the  recovery  sys¬ 
tem  (without  being  delayed).  When  the  recovery  algorithm  receives  the 
lirat.  read  Issued  by  a  query  Tlf  say  Beadi(Pj),  it  reads  the  commit  list 
and  then  selects  the  newest  oopy  on  the  chained  list  of  Pj  oopies  whose 
transaction  identifier  is  on  the  commit  list.  Subsequent  reads  by  TA 
are  processed  in  the  same  way,  using  the  oopy  of  the  commit  list  that 
was  read  when  the  first  Bead^  was  processed.  By  reading  in  this  way, 
queries  see  a  consistent  copy  of  the  database,  yet,  they  do  not  set  read 
looks  that  might  delay  update  transactions. 

Another  undo/no-redo  algorithm  is  described  in  [Ba]. 


A.  Algorithms  That  Bedo  But  Don't  Undo 

In  the  generic  algorithm,  each  command  is  processed  as  follows. 

Beadi(pj).  if  Tj.  previously  wrote  Pj,  then  oopy  the  after-image  of 
Pj  into  a  buffer.  Otherwise,  oopy  Pi  from  the  stable  database  into  a 


buffer.  {A ok} 

Writ«i(Pj).  Writ*  the  new  value  of  Pj  into  the  audit  trail.  (Aok) 

Ccmmit±.  Vrita  Tg  into  the  ooaait  Hat.*  Then  for  eaeh  page  writ- 
Un  by  Tg,  oopy  the  after-laage  froa  the  audit  trail  into  the  stable 
database.  (Aok)  Then  delete  Tg  from  the  active  list. 

Abort1#  Write  Tg  into  the  abort  list.  {Aok}  Then  delete  it  from 
the  aotlve  list. 

Restart.  Por  each  Tg  that  is  on  the  active  list  but  not  on  the 
nnanl t  list,  process  Abortg.  {Aok}  Por  each  Tj  on  the  aotive  list  aim 
the  ooenit  list,  prooess  Ccamitj. 

In  this  algorithn,  pages  written  by  a  transaction  are  not  written 
into  the  stable  database  until  after  the  transaction  ooaait s.  Thus, 
undo  is  never  needed,  but  a  Restart  nay  require  redo. 

This  algorithm  obeys  the  ooaait  rule,  because  the  after-image  of 
pages  written  by  Tg  ere  stored  on  the  audit  trail  before  Tg  ooaaits.  It 
also  obeys  the  log  ahead  rule,  since  no  after-iaage  of  a  transaction  is 
written  into  the  stable  database  before  it  ooaaits. 

Implementations  of  this  algorithm  are  described  in  [LS,  ML].  This 
type  of  recovery  algorithm  is  used  in  the  Ingres  Database  Systea  [St] 
and  in  SDD-1  [BSR]. 
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5*  Algorithms  That  Bado  Bad  Dodo 

In  this  algorithm,  ooaaands  are  processed  aa  follows. 

***di(Fj).  If  Ti  previously  wrota  Pj,  than  oopy  tha  aftar-lnage  of 
Fj  Into  a  buffer.  Otherwise,  oopy  Pj  fron  the  stable  database  Into  a 
buffer.  {Ack} 

Wrlt#i(Pj).  Copy  the  before-laage  and  the  after-laage  of  Pj  into 
the  audit  trail.  {Aek}  Then,  sonatina  later,  write  the  after-laage  into 
the  stable  database. 

Commit^.  Write  IA  into  the  oonnit  list.  Then,  for  each  page  writ¬ 
ten  by  Tj,,  write  the  after-laage  into  the  stable  database  (if  it  hasn't 
already  been  done).  {Aok)  Then,  delete  TA  fron  the  active  list. 

Abort^.  Write  Tj.  into  the  abort  list.  Then,  for  each  pegs  written 
by  Tit  if  its  after-laage  has  already  been  written  into  the  stable  data¬ 
base,  write  its  before-inage  into  the  stable  database.  {Ack}  Then 
delete  TA  fron  the  active  list. 

Hestart.  For  eaoh  T±  on  the  active  list  and  the  oonnit  list,  pro¬ 
cess  Coaalt^.  For  eaoh  on  the  aotlve  list  but  not  on  the  oonnit 
list,  process  Aborts  {Aok} 

Mote  that  Abort  nay  require  undo  and  Restart  nay  require  redo. 

This  algorltha  obeys  the  oonnit  rule,  sinoe  the  after-inage  of  each 
page  written  by  TA  la  written  into  the  audit  trail  before  T*.  oonnit s. 
It  also  obeys  the  log  ahead  rule,  sinoe  the  before-lnage  of  each  page 
written  by  la  written  into  the  stable  database. 

One  can  improve  the  perfornanee  of  this  algorltha  by  using  a  varia¬ 
tion  proposed  by  Gray  [Or].  Gray's  algorltha  processes  oo amends  as  fol¬ 
lows. 

ftaadj^Pj),  if  Ti  previously  wrote  Pj,  check  to  see  if  the  after- 
laage  la  in  buffer  storage.  If  not,  oopy  Pj  from  the  stable  database  to 
a  buffer.  {Aok} 


Vrlt«j,(Pj).  Copy  the  before- luge  of  Pj  into  buffer  storage  unless 
It  is  already  there.  Vrite  the  after- iaage  of  Pj  into  buffer  storage; 
this  step  sust  not  overwrite  the  before-laage.  {Ack}  Sometime  later, 
write  the  before- iaage  into  the  audit  trail,  leaving  a  oopy  of  the 
after-laage  in  buffer  storage.  The  after- iaage  aay  be  written  into  the 
stable  database  any  tine  after  the  before- iaage  is  written  into  the 
audit  trail.  Once  the  after-laage  is  written  both  to  the  audit  trail 
and  the  stable  database,  it  aay  be  raaoved  froa  buffer  storage. 

Comity.  After  all  the  after-iaages  of  pages  written  by  Tj_  have 
been  written  into  the  audit  trail,  write  into  the  ooaait  list.  {Ack} 

Abort^  and  Restart  are  the  saae  as  the  generic  algorltba. 

This  algorltba  obeys  the  log  ahead  rule  because  the  before-laage  of 
each  page  is  written  in  the  audit  trail  before  the  after-laage  is  writ¬ 
ten  in  the  stable  database.  The  ooaait  rule  is  also  satisfied  since 
T^'s  after-iaages  are  written  into  the  audit  trail  before  Ti  coaalts. 

When  all  after-iaages  written  by  T±  have  been  written  into  the 
stable  database,  T^  can  be  deleted  froa  the  active  list.  This  tells 
Restart  that  does  not  need  to  be  redone. 

The  aain  benefit  of  this  algorltba  is  that  the  decision  to  write 
pages  into  stable  storage  is  usually  left  to  the  database  system's 
buffer  aanageaent  algorithm.  The  recovery  algorithm  writes  into  stable 
storage  only  when  the  coaait  or  log  ahead  rule  requires  it. 

A  detailed  Implementation  of  this  algorithm  which  incorporates 
checkpoints  and  in  which  transactions  write  reoords  instead  of  entire 
pages  appears  in  [Li]. 


6.  Algorithms  That  Don't  Undo  Or  Redo 

In  the  generic  algorithm,  eaob  command  is  processed  as  follows. 

Readi(Pj).  if  Ti  previously  wrote  Pj,  then  oopy  the  after-image  of 
Pj  into  a  buffer.  Otherwise,  oopy  Pj  froa  the  stable  database  into  a 


buffer,  {lok} 


Vrlt«1(pj).  Wr  1‘: j  the  after-image  of  Pj  into  the  audit  trail. 

Uek} 

Cnaai tA.  in  a  single  instruction,  write  the  after-images  of  all 
pages  written  by  into  the  stable  database  and  delete  TA  from  the 
active  list.  {Ack} 

Abort^  Write  Tj.  into  the  abort  list.  {Aok}  Then  delete  it  from 
the  active  list. 

Restart.  For  each  IA  on  the  active  list,  process  Aborts .  {Ack} 

Unfortunately,  this  description  isn't  very  inf  or native  because  it 
relies  on  a  aaglcal  instruction  that  inpleaents  commit  without  even 
using  a  commit  list.  Notice  that  if  the  aagioal  instruction  is  avail¬ 
able,  then  undo  isn't  needed  beoauae  a  transaction's  after-inages  are 
not  written  into  the  stable  database  before  it  commits,  and  redo  isn't 
needed  because  a  transaction's  after-images  are  written  into  the  stable 
database  in  the  Instruction  that  commits  the  transaction. 

Ve  will  describe  an  implementation  of  the  Commit  instruction  simi¬ 
lar  to  one  presented  in  [Lo]. 

Lorle's  Shadow  Page  Algorithm 

Assume  that  the  stable  database  is  partitioned  into  files 
{Fl,...(F2)t  each  of  which  is  a  sequence  of  logical  pages.  Bach  file, 
Fj,  has  a  page  table.  PTj,  whose  entries  point  to  the  pages  of  Fj.  That 
is,  PTj[k]  contains  the  address  of  the  k-th  page  of  Fj;  this  page  is 
denoted  Pjk.  Assume  that  each  page  table  fits  on  one  page  in  the  stable 
database.  The  stable  database  also  oontalns  in  a  fixed  address  a 
record.  M,  that  points  to  the  n  page  tables;  M[J]  oontalns  the  address 
of  FTj. 

Abort  and  Restart  are  processed  as  in  the  generic  algorithm.  Read, 
Write,  and  Commit  are  processed  as  follows. 

For  each  file.  Fj,  the  first  Read  or  Write  that  Tj  issues  on  a  page 
of  Fj  causes  the  recovery  algorithm  to  make  a  oopy  of  PTj  in  buffer 
storage,  denoted  PTji#  For  eaoh  page  Pj^  that  Ti  writes,  PTji[k]  will 


point  to  the  after-image  of  that  page  in  the  audit  trail.  (Tba  other 
entrlee  in  PTj^  are  irrelevant.) 

Beedi(pjk).  if  previously  wrote  Pj,  then  oopy  the  after-image 
of  Pj  fron  address  PTjk[k]  into  a  buffer.  Otherwise,  use  M  to  find  PTj 
and  oopy  Pjk  from  address  PTj[k]  in  the  stable  database  into  a  buffer. 
Uok) 

WriteitPj).  Write  the  new  copy  of  Pjk  into  the  audit  trail.  Then, 
assign  PTji[k]  the  address  of  that  audit  trail  page.  {Ack} 

Commit^.  copy  M  into  buffer  storage.  For  each  file  Fj  that  Tf 
wrote  into,  use  (the  buffer  oopy  of)  M  to  find  PTj  and  oopy  it  into  an 
empty  page  of  buffer  storage.  (There  are  now  two  page  tables  for  Fj 
connected  to  T^j  the  buffer  oopy  of  PTj  that  was  Just  read  and  PTji.) 
For  each  page  Pjk  that  was  written  by  T^,  assign  to  the  buffer  oopy  of 
PTj[k]  the  contents  of  PTji[k].  Then,  write  PTj  into  a  new  location  in 
scratch  space;  denote  this  new  copy  of  PTj  by  PTJ.  Then,  for  each  Fj 
that  Ta  wrote  into,  assign  to  (the  buffer  oopy  of)  M[J]  the  address  of 
PTj.  Then  write  M  back  to  its  fixed  address  in  stable  storage.  (Ack) 

The  commit  algorithm  prepares  a  scratch  oopy  of  the  page  table 
(PTj).  This  is  accomplished  by  assigning  to  M[j]  the  address  of  PTJ  for 
each  file  Fj  that  Tj.  wrote.  By  writing  M  back  to  the  stable  database, 
the  old  ooplea  of  the  page  table  (PTj)  are  replaced  by  the  new  ones 
(PTJ). 

The  Instruction  that  oommits  T^  is  the  one  that  writes  the  updated 
H  back  into  the  stable  database.  Before  this  write,  any  read  will  use 
the  old  oopy  of  H  to  read  the  before- image  of  any  page  written  by  TA. 
After  this  write,  it  will  read  the  after-image  of  any  such  page. 

The  recovery  algorithm  can  only  commit  one  transaction  at  a  time. 
That  is,  Commit  is  a  critical  aectlon.  If  two  transactions  were 
(Incorrectly)  to  nrnailt  concurrently,  each  transaction  might  read  a  oopy 
of  PTj  into  buffer  storage,  change  the  pointers  to  pages  it  wrote,  and 
write  that  oopy  of  PTj  to  the  audit  trail.  Thus,  two  ooples  of  PTJ 
would  agist.  Whichever  transaction  updated  M  first  would  lose  its 
updates  to  PTj,  since  they  would  be  overwritten  by  the  second  transac¬ 
tion  when  it  Installed  its  oopy  of  PTj  by  updating  M. 


A  version  of  Lorles's  algorith*  la  implemented  la  System  B's 
recovery  manager  [QMBLL] . 


7*  Hacovery  In  A  Distributed  Database  System 

A  distributed  database  system  (DDBS)  consists  of  a  set  of  sites 
connected  by  a  network.  Each  transaction  oan  read  or  write  data  stored 
at  any  of  the  sites. 

Ve  model  a  DDBS  by  a  set  of  processes  called  data  modules  (DMs)  and 
transaction  modules  (TMs).  A  EM  Is  a  centralised  database  system  as 
defined  in  Section  2.  It  processes  Beads  and  Writes  on  pages  stored  at 
that  EM.  It  also  processes  Commits  snd  Aborts,  which  permanently 
install  or  undo  the  writes  of  a  transaction  at  that  EM. 

A  TM  interfaces  transactions  and  DMs.  Bach  transaction,  Tit  sub¬ 
mits  commands  to  one  TM,  say  TMa.  To  process  Bead*  or  Write*,  TMa  sim¬ 
ply  sends  the  command  to  the  EM  that  stores  the  data  being  read  or  writ¬ 
ten.  Let  Active*  be  the  EMs  at  which  T*  was  active.  To  process  Abort*, 
THa  must  ensure  that  every  EM  in  Active*,  processes  Abort*.  To  process 
Commit*,  TMa  should  try  to  ensure  that  every  EM  in  Active*  processes 
Commit*. 

Unfortunately,  TMs  and  DMs  may  fail  at  unpredictable  times.  TMa 
must  process  commands  so  that  suoh  failures  never  cause  it  to  produoe 
incorrect  results. 

We  assume  that  process  (l.e. ,  TM  and  DM)  failures  are  "clean".  If 
a  process  does  not  produoe  an  expected  response  to  a  message  within  a 
timeout  period,  then  the  process  has  really  failed.  If  one  prooess 
believes  another  process  is  down,  then  mX 1  processes  believe  that  the 
process  is  down.  And,  when  a  process  recovers,  it  recognizes  that  it 
has  Just  recovered  from  a  failure  and  runs  s  special  "reintegration  pro¬ 
tocol".  Mechanisms  that  support  these  assumptions  are  beyond  the  soope 
of  this  paper.  (See  [ABO,  BS,  PB,  Wa].) 
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Kaoh  TM  keeps  an  aotlve  list,  oornalt  list,  and  abort  Hat  in  stable 
storage.  And,  for  eaoh  T*  on  the  active  list,  it  aaintains  Active*  in 
stable  storage.  Hhen  it  receives  a  Read  or  Write  from  Tit  it  sends  the 
ooamand  to  the  appropriate  EM  and  adds  that  EM  to  Active*.  For  the 
first  such  Bead  or  Write,  the  TM  also  adds  T*  to  its  active  list.  It 
processes  Abort  and  Commit  as  follows. 

Abort*.  Add  T*  to  the  abort  list.  Then,  send  Abort*  to  each  EM  in 
Active*.  Wait  for  every  EM  to  acknowledge  that  it  processed  Abort*. 
Uck}  Delete  T*  from  the  active  list. 

C0®*1*1!.  Add  T*  to  the  oosslt  list.  Then,  send  Commit*  to  each  EM 
in  Active*.  Wait  for  every  EM  to  acknowledge  that  it  processed  Commit*. 
{Ack}  Delete  T*  from  the  active  list. 

If  a  TM  falls  and  later  restarts,  then  it  processes  a  Restart  in 
the  usual  way:  For  each  T*  on  both  the  ooamit  list  and  the  active  list, 
process  Commit*.  For  all  other  T*  on  the  active  list,  process  Abort*. 

If  a  TM,  say  TMai  discovers  that  a  EM,  say  BMb,  has  failed,  then  it 
normally  processes  Abort*  for  each  T*  that  has  EMb  in  Active*.  But  what 
if  EMb  is  in  Active*  and  TMa  has  already  sent  Commit*  to  other  EMs  in 
Active*?  in  this  case  it  can't  abort  T*,  because  T*  nay  already  be  com¬ 
mitted  at  some  EMs.  Instead,  it  must  wait  for  EMb  to  reoover.  When  it 
does,  TMa  sends  Commit*  to  EMb  too. 

Two-Phase  r.nm<  » 

Each  TM  must  obey  the  commit  rule.  That  is,  it  must  not  send 
Commit*  to  any  EM  until  every  EM  in  Active*  has  T*'s  after-images  on 
stable  storage.  Otherwise,  a  EM  in  Active*  nay: 

1.  Fail  before  receiving  Commit*. 

2.  Upon  recovering,  discover  from  TMa  that  T*  has  committed. 

3*  But  be  unable  to  process  Commit*  because  it  lost  some  of  T*'s 
after-images  due  to  the  failure. 

To  obey  the  commit  rule  and  thereby  prevent  (3),  TMa  csn  UM  the 
two- phase  protocol  for  processing  Commit  commands  [LS],  Phase 
one  begins  when  TMa  receives  Commit*.  It  then  sends  a  command  called 
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Kadj.  to  eaoh  OH  Id  Active*.  A  IK  processes  End*  by  first  ensuring  that 
T*»s  after- luges  at  that  OH  are  on  stable  storage  and  then  sending  an 
acknowledgement  to  THa.  When  TMa  has  received  the  acknowledgement  from 
every  OH  in  Active*,  phase  one  is  done.  {Ack}  In  phase  two,  IMa  sends 
Commit*  to  each  OH. 

Since  TMa  does  not  send  Commit*  to  anv  eh  until  every  OH  has  ack¬ 
nowledged  End*,  no  OH  in  Active*  will  process  Commit*  until  every  OH  has 
T*'s  after- luge  on  stable  storage. 

If  a  OH,  say  0Hb,  fails  before  acknowledging  End*,  then  IMa  won't 
leave  phase  one.  Slnoe  TMa  cannot  be  sure  that  OHp  will  be  able  to 
process  Commit*  when  it  recovers,  TMa  must  either  wait  for  OHp  to 
recover  or  abort  by  sending  Abort*  to  every  OH  in  Active*.  In  prac¬ 
tice,  IMa  simply  waits  a  prespecified  timeout  period  after  distributing 
the  End*'s;  if  it  hasn't  received  an  acknowledgement  of  some  End*  by 
this  time,  it  assumes  the  OH  has  failed  and  aborts  T*. 

Until  a  OH  processes  End^,  it  may  unilaterally  decide  to  abort  T* 
by  sending  an  Abort*  command  to  TMa.  Once  a  OH  acknowledges  End*,  it 
loses  its  right  to  unilaterally  abort  Tit  and  uy  only  abort  T*  if 
directed  to  do  ao  by  TMa. 


Three-Phaae 

The  TM  algorithm  presented  above  has  a  serious  disadvantage.  Sup¬ 
pose  TMa  sends  End*  to  OHp,  OHp  acknowledges  End*,  and  then  TMa  fails. 
Sinoe  0Hb  doesn't  know  whether  T*  will  oommit  or  abort,  it  has  to  wait 
for  ™a  to  recover.  In  particular,  it  must  hold  T*'s  locks  until  TMa 
recovers.  If  TMa  is  supervising  many  active  transactions,  large  por¬ 
tions  of  the  database  uy  be  locked  and  unavailable  until  IMa  recovers. 

He  can  avoid  this  problem  by  providing  each  IM  with  one  or  more 
backup  TMs.  If  a  TM  fails,  the  backups  can  take  over  its  functions. 

One  such  algorithm  is  three-phase  ftnmit  [Ski,  2,  3,  SS],  Bach 
backup  for  1Ma  maintains  a  oommit  list,  CLa.  To  process  Commit*  IMa 
behaves  as  follows. 


1 •  1Mb  Mads  End*  to  ssob  OH  in  Active^  Than,  it  waits  for  all  OMs  to 
acknowledge  their  Snd^'a. 

2.  TMa  sends  a  command  called  Preeommlti  to  eaoh  backup  TM.  A  TM 
processes  Preeommlti  by  adding  Ti  to  its  oopy  of  CLb,  and  then  send- 
ing  an  acknowledgement  to  TM^.  tHb  waits  for  all  backups  to  ack¬ 
nowledge  Precosnit^. 

3.  TMa  sends  Coamit^  to  each  OH  in  Active 

Essentially,  this  is  the  two-phase  commit  protocol  w  a  new  phare 
added  (step  (2)). 

If  a  backup  TM  fails,  TMa  can  ignore  the  failure  if  the  number  of 
backups  is  still  acceptably  large;  otherwise,  it  should  acquire  another 
backup  TM  to  replace  the  failed  one. 

Suppose  TMa  fails.  When  the  backups  disoover  the  failure,  they 
elect  one  of  their  member  TMs,  say  TMbf  to  replaoe  THb.  After  TMb  is 
elected,  every  other  backup  TM  Mnds  its  oopy  of  CLa  to  TMb.  TMb 
the  union  of  those  copies  and  distributes  the  result  to  other  backups. 
This  becomes  everyone’s  oopy  of  CLa.  When  this  prooess  is  oomplete,  TMb 
tells  all  IMs  that  it  has  taken  over  TMa»8  functions. 

If  a  DM  wants  to  know  what  happened  to  a  particular  transaction, 
Tj.,  that  was  supervised  by  TMa,  it  asks  TMb.  lf  Tj.  is  in  TMb's  CLa, 
then  TMb  tells  the  DM  to  commit  T^;  otherwise,  it  tells  the  DM  to  short 
Ti.  Thus,  a  transaction  that  was  supervised  by  TMa  is  committed  if  and 
only  if  it  reached  the  second  phase  of  three-phase  oommlt  and  at  least 
one  of  its  precommlta  reached  a  backup  TM  (that  didn't  fail). 

The  algorithm  for  eleotlng  a  backup  TM  to  replaoe  TMa  is  easy,  as 
long  as  none  of  the  backups  fail  or  recover  from  failure  during  the 
election.  Assume  each  TM  has  a  unique  identifier.  To  elect  a  replace¬ 
ment  for  TMat  each  backup  exchanges  its  identifier  with  every  other 
beolcup.  The  TM  with  the  largest  identifier  wins  the  election  and  takes 
over. 

If  backup  TMs  fall  or  recover  from  failure  during  the  election,  the 
above  algorithm  oan  misbehave.  Each  of  two  TMs  can  conclude  that  it  won 
the  election.  Algorithms  to  prevent  this  behavior  are  disoussed  in  [Ga, 


-190- 


It  is  possible  that  TH^  and  all  of  its  backups  fail  during  a  abort 
time  period  —  too  short  for  replacement  backups  to  be  acquired.  This 
is  called  a  total  failure  of  IMaj  no  TM  can  ever  take  over  its  function. 
IMs  aust  wait  until  IM,  and  enough  of  its  backups  have  recovered  that 
the  oorrecta  status  of  TMa's  transactions  can  be  determined.  Algorithns 
for  recovering  after  total  failure  are  discussed  in  [Ski]. 

Many  variations  on  three-phase  ooanit  protocols  have  been  proposed 
and  analyzed.  See  [iBDG,  ID,  Co,  Ea,  HS,  La,  MPM,  TGGL] . 

Bepllcated  Data 

If  a  DM  fails,  transactions  that  need  that  DM's  data  aust  wait  for 
the  DM  to  recover.  To  avoid  this  delay,  the  DBS  can  replicate  data; 
that  is,  it  can  store  parts  of  the  database  at  sore  than  one  DM.  If  one 
copy  is  unavailable  due  to  a  DM  failure,  other  copies  can  be  used 
instead. 

Many  concurrency  oontrol  algorithms  are  known  for  keeping  multiple 
copies  of  each  page  mutually  oonslstent.  However,  even  if  concurrency 
control  is  performed  correctly,  failures  can  cause  transactions  to  mal¬ 
function. 

For  example,  suppose  Pj  has  oopies  P1a  and  Plb  DMa  and  DMb 
(reap.),  and  P2  has  oopies  ?2C  and  P2d  st  DMC  and  DMd.  reads  P^  and 
writes  P2;  T2  reads  P2  and  writes  Pj.  implicated  data  is  handled  by 
the  ■intuitive"  algorithm:  to  read  data,  read  any  oopy;  to  write  data, 
write  all  available  copies.  The  following  execution  obeys  these  rules, 
yet  it  is  lnoorrect. 

Beadj(P1a) - *DMd-fails — ►Wrlte1(P2C) 

Bead2(P2dy^DMa-fails— »  Hrite2[Pib) 

This  execution  is  Incorrect  because  Tj  reads  (a  copy  of)  Pj  before  T2 
writes  P^,  while  T2  reads  (a  oopy  of)  P2  before  Tj  writes  P2.  The  first 
condition  means  that  Tj  appears  to  precede  T2,  while  the  second  condi¬ 
tion  means  that  T2  appears  to  precede  Tj .  These  conditions  cannot  both 
hold  in  a  serial  execution,  and  so  the  given  execution  is  incorrect. 


Algorithms  for  correctly  processing  commands  on  replicated  data  in 
the  presence  of  DM  failures  appear  in  [ABDG ,  AD,  Ea,  Gi ,  HS,  MPM,  Th] . 

No  consensus  on  the  best  approach  to  this  problem  has  yet  emerged. 

8 .  Network  Partitions 

A  partition  is  a  communication  failure  that  splits  the  network  of 
sites  into  two  or  more  subnetworks  such  that  each  site  in  one  subnetwork 
is  unable  to  communicate  with  any  site  in  any  other  subnetwork  (besides 
its  own) .  After  the  partition,  the  DDBS  must  decide  how  to  continue  with 
transactions  that  were  active  at  the  time  of  the  partition  and  those  that 
begin  executing  after  the  partition. 

The  latter  transactions  are  easy  to  handle.  A  transaction  can  be 
processed  normally  provided  that  it  reads  and  writes  pages  that  reside  in 
one  communicating  subnetwork  and  that  none  of  the  pages  it  writes  have 
replicated  copies  of  the  subnetwork.  Otherwise,  the  transaction  cannot 
be  processed  at  all. 

Transactions  that  were  active  the  time  of  the  partition  are  more 
difficult  to  handle.  If  a  transaction  was  (and  needs  to  be)  active  only 
at  sites  in  one  communicating  subnetwork,  then  it  can  continue  being  pro¬ 
cessed  normally.  If  a  transaction  had  not  yet  processed  its  End^  at  some 
site  in  a  subnetwork,  then  the  transaction  can  be  aborted  in  that  subnet¬ 
work.  If  all  sites  in  a  communicating  subnetwork  have  processed  End^ ,  am 
one  or  more  have  also  processed  Commit^  (resp.  Abort^) ,  then  the  trans¬ 
action  can  be  committed  (resp.  aborted)  at  all  sites  in  that  subnetwork. 
If  all  sites  in  a  subnetwork  have  processed  End^,  but  none  have  processed 
Commit,  or  Abort,,  then  the  transaction  is  stuck.  This  subnetwork  cannot 


determine  whether  the  transaction  was  committed  or  aborted  in  some  other 
subnetwork  with  which  it  cannot  communicate.  Thus,  it  must  leave  the 
transaction  in  an  active  but  blocked  state,  until  the  partition  is 
repaired. 

Avoiding  this  blocking  situation  has  been  the  subject  of  much  dis¬ 
cussion  and  research.  The  probability  of  this  event  can  possibly  be 
decreased  by  careful  database  design  and  careful  selection  of  backup  TM's 
for  three-phase  commit.  However,  the  situation  apparently  cannot  be  elimin¬ 


ated  entirely  [Ski]. 
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1. 


INTRODUCTION 


1.1  Overview 

In  this  paper  we  develop  an  operational  (i.e.  state-based)  model  for 
studying  reliability  of  database  systems  (dbs).  The  system  is  described  at 
any  point  in  time  by  a  "system  state".  Reliability-related  properties  of  the 
system  (e.g.  "resiliency")  can  be  expressed  as  predicates  on  the  system  state 
Transaction  processing  algorithms  {hereafter  called  simply  "algorithms")  can 
be  described  as  state  transition  functions,  mapping  the  current  system  state 
to  the  next.  Finally,  correctness  and  other  reliability  properties  of 
algorithms  can  be  proved  formally  by  examining  the  system  state  sequences 
that  can  be  generated  by  the  algorithms  in  question. 

The  paper  is  organized  as  follows:  In  the  remainder  of  this  section  we 
motivate  our  model  by  informally  describing  the  problem  of  dbs  reliability. 
The  model  itself  is  introduced  in  Section  2.  In  Section  3  we  use  the  model 
to  define  algorithms  and  prove  their  reliability  properties.  Section  4 
extends  the  model  to  describe  certain  aspects  of  reliability  in  distributed 
dbs.  Section  5  comments  on  the  model  and  suggests  extensions. 

1.2  The  Problem  of  dbs  Reliability 

A  database  is  a  set  of  data  items.  For  each  item  x  there  is  a 
designated  address  in  stable  storage  (e.g.  disk)  where  a  value  for  x  is 
stored;  x's  designated  address  may  change  over  time.  The  values  stored  at 
the  designated  addresses  of  all  data  items  at  time  t  comprise  the 
materialised  database  at  t  [HR82]. 

To  update  x,  a  transaction  provides  the  dbs  with  x's  new  value  by 
placing  it  in  a  (volatile)  main  memory  buffer.  Later  the  dbs  may  copy  the 
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value  of  x  from  the  buffer  to  x's  designated  address  either  directly  or 
after  first  copying  it  to  another  area  of  stable  storage,  called  the  audit 
trail.  So,  in  effect,  values  of  a  data  item  may  be  found  in  three  places: 
in  a  main  memory  buffer,  in  the  audit  trail,  or  in  the  item's  designated 
address. 

Until  a  transaction  T  executes  its  "Commit"  operation,  it  is  always 
possible  for  the  dbs  to  abort  T.  This  means  that  all  updates  of  T  to  the 
database  must  be  "undone"  (by  setting  all  updated  items  to  the  values  they 
had  prior  to  being  changed  by  T) .  Also  any  transaction  that  read  values 
produced  by  the  aborted  transaction  T  must  be  (recursively)  aborted  as 
well.  Once,  however,  the  dbs  agrees  to  process  a  transaction's  commit 
operation,  it  cannot  subsequently  abort  it. 

The  problem  of  database  reliability  is  essentially  a  fault  tolerance 
problem.  We  want  the  dbs  to  correctly  process  database  operations  submitted 
by  transactions ,  in  spite  of  transaction  and  system  failures.  A  transaction 
failure  amounts  to  that  transaction's  being  aborted.  A  system  failure  amounts 
to  the  loss  of  volatile  storage:  in  such  an  event,  the  dbs  must  reconstruct 
from  the  contents  of  stable  storage  only  (i.e.  the  materialized  database  and 
the  audit  trail) ,  the  state  of  the  database  reflecting  the  execution  of 
exactly  those  transactions  that  were  committed  by  the  time  the  system 
failure  occurred. 

An  algorithm  for  processing  database  operations  is  fault  tolerant  with 
respect  to  transaction  and  system  failures  if  it  can,  at  all  times,  (a)  abort 
an  uncommitted  transaction  without  having  to  (recursively)  abort  a  committed 
one  and  (b)  reconstruct  the  correct  state  of  every  data  item  from  values 
stored  in  stable  storage. 
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Accordingly  we  can  distinguish  two  aspects  of  dbs  reliability.  The  first 
aspect  concerns  the  parallel  processing  cf  transactions.  We  want  to  allow 
only  such  parallel  executions  as  will  not  require  aborting  a  committed  trans¬ 
action  in  the  face  of  a  transaction  or  system  failure.  In  [H82]  we 
characterized  executions  that  have  this  property,  called  recoverable  execu¬ 
tions.  In  such  executions,  a  transaction  reads  the  value  of  a  data  item  only 
if  it  was  written  by  a  committed  transaction. 

The  second  aspect  of  dbs  reliability  arises  from  the  general  principle 
that  fault  tolerance  requires  redundancy.  All  known  dbs  reliability  algorithms 
require  that  some  "redundant"  data  be  kept  in  stable  storage.  This  data  is 
redundant  in  that  it  is  only  needed  in  case  of  a  (transaction  or  system) 
failure.  In  essence,  this  redundant  data  forms  what  we  previously  called 
the  audit  trail. 

Bernstein  and  others  IBGH82]  present  a  classification  of  algorithms  in 
terms  of  their  "undo/redo  characteristics".  We  say  that  an  algorithm  mag 
require  undo  if  it  allows  an  uncommitted  transaction  to  record  updates  in  the 
materialized  database;  if  the  uncommitted  transaction  aborts,  its  update  must 
be  undone.  An  algorithm  may  require  redo  if  it  allows  a  transaction  to  commit 
before  all  its  updates  have  been  recorded  in  the  materialized  database;  if  the 
system  fails,  the  transaction's  updates  must  be  redone.  We  thus  have  four 
"classes"  of  algorithms: 


Algorithm 

Requirements 
UNDO  REDO 

Class  1 

[G78],  [Li79] 

maybe 

maybe 

Class  2 

1875] ,  ID82] 

maybe 

never 

Class  3 

[LS76] ,  [ML80 ] 

never 

maybe 

Class  4 

[Lo77],  [V78] ,  [HR79] 

never 

never 

'  -  V  V  v 
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In  Section  3  we'll  cast  these  four  classes  of  algorithms  in  terms  of  our  model. 


2. 


THE  MODEL 


2.1  Preliminaries 

In  this  subsection  we  summarize  some  basic  definitions  and  notation. 

For  motivation  and  fuller  discussion  see  IH82]. 

Let  D * {x,y ,z , . . . }  be  a  set  of  data  items.  The  symbols  R^ lx] ,  K.  [x] 
A^,  C  where  i€3N  and  x€D  are  called  database  operations.  R^  [x]  is 
the  read  operation  (issued  by  transaction  i  for  data  item  x) .  W^Ix]  is 
the  write  operation  (issued  by  transaction  i  for  data  item  x) .  A^  is 
the  abort  operation  of  transaction  i  and  is  the  commit  operation  of 

transaction  i. 

Two  operations  conflict  iff 

(a)  they  are  read  or  write  operations  accessing  the  same  data  item  and  (at 
least)  one  of  them  is  a  write  operation;  or 

(b)  (at  least)  one  of  them  is  a  commit  or  an  abort  operation. 

A  transaction  is  a  partial  order  (0P^,<^),  where 

(i)  0Pi  c  {Ri[x)  ,Wi[x]  ,Ai,Ci|x€  D} 

(ii)  Any  two  conflicting  operations  in  OP^  are  ordered  by  <^, 

(iii)  A^  €  OP^  iff  ?  OP^ ,  and 

(iv)  If  A^  €  OP^  then  for  every  a€OP^-{A^},  a  A^  and  if 

C.  €  OP.  then  for  every  a  £  OP.  —  { C . } .  a  <  .  C.  . 

11  1111 

A  complete  log  L  (over  transactions  l<i<n),  is  a  partial  order 

L  =  (OP  ,<  )  where 

L  L  n 

(i)  0PT  *  U  OP., 

L  i*l  1 

(ii)  <  =><.,  for  l<i<n,  and 

L  1 

(iii)  Any  two  conflicting  operations  in  OP.  are  ordered  b>  <  . 


i 


A  log  L  *  (OP  ,<  )  is  a  prefix  of  some  complete  log  L’  *  (OP  , ,<  ) 

L  L  L  L 

If  C.  €OP_,  transaction  T,  is  committed  in  L;  if  A.  6  0PT ,  T.  is 

1  L  .  1  X  L  1 

aborted  in  L.  Com(L)  ■  {t. |c.  €  OP  }.  The  projection  of  a  log  L  onto  a  set 

XXL 

of  transactions  T,  denoted  it  (L)  ,  is  the  restriction  of  L  to  the  domain 
OP.  -  'J  OP .  . 

L  T.«T  1 
1 

For  mathematical  simplicity  we  expand  logs  with  a  (fictitious)  initial¬ 
izing  transaction  T^  that  writes  all  data  items  and  commits  I P 79 ] .  That  is, 
Tq  *  (OPq  ,<o>  *  where  OP0.{Vx)|x6D}u{C0}.  and  <Q  =  {  (WQ  [xl  ,CQ)  |  x  £  D} . 

All  of  TQ's  operations  precede  all  other  operations  in  the  log;  i.e.  if 

the  log  is  L,  for  any  a  6  0PA  and  any  b  £  OP  -OP  ,  a  <  b. 

U  L  u  L 

The  Herbrand  semantics  of  read  and  write  operations  in  a  log  L,  denoted 


M^O^lx])  and  M^VTlx])  respectively,  are  defined  inductively  on  <L  as 
follows : ^ 


(1)  M  (R.  [x])  (W.  (x])  ,  where  W.  [x]  <  R.  Ix]  and  there  is  no 

L  X  L  ]  ]  L  1 


W  [x]  £  OP  such  that  W.IxJ  <  K  lx]  <  R.  lx] . 

K  L  ]  L  K  LX 

(2)  Let  R^y^.]  l<t<m  be  all  the  read  operations  in 


such  that 


R .  [y^-  3  <,  w.  Ix].  Then  M  (W  [x]  )=  g  (M  (R  [y  ]),...  ,M  (R.  ly  ] ))  - 

X  t  11  L  1  XX  L  1  1  -Lj  1  Hi 

In  particular,  if  there  is  no  [y  ]  £  OP^  such  that  I^Iy]  <i  hV  [x] 

then  M  (W. [x])  =  g  (  ) . 

L  i  ix 


fP=  (X,<)  is  a  prefix  of  P'  =  (X*  ,<•)  if  XcX’,  <  is  the  restriction  of 

<*  to  X  (i.e.,  a<b  iff  a,b£X  and  a<’b),  and  X  is  closed  under 

<  '  (i.e.  ,  if  a  £  X  and  for  some  b  £  X'  ,  b  <  ’  a  then  b  €  X)  . 


'L  being  a  discrete  (indeed,  finite)  partial  order,  we  may  induce  on  the 
order  of  its  elements.  (The  "order"  of  an  element  of  a  partial  order  <, 
is  equal  to  the  number  of  elements  that  precede  it  in  <.) 


is  an  uninterpreted  function  symbol. 
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Note  that,  by  our  assumption  concerning  the  expansion  of  logs  by  the 

initializing  transaction  TQ,  part  (1)  of  the  definition  is  well-defined, 

since  WQ [ x ]  precedes  any  R^lx].  Also,  by  part  (2)  of  the  definition, 

M  (W  [x] )  =  g  (  ),  for  any  x £  D. 

L  0  ox 

The  Bertrand  universe  of  a  log  L  is  the  set  HU(L)  =  {m  (a) |a  £0P 

L  !■» 

is  a  read  or  write  operation}.  The  semantics  of  a  log  L,  is  the  function 
M  [L]  :  D  ■+  HU  (L)  defined  by  M[L] (x)  = M  (W. [x] )  where  W.  [x]  is  the 

Xj  1  1 

<  -maximum  write  operation  on  x  in  OP  .  A  log  L  is  recoverable  iff  for 

L  Xu 

every  prefix  L'  of  L  and  for  all  a  £  0P_  ,  ,  where  a  is  a 

"Com(L’)  L 

read  or  write,  M  (a)  =  M_  ,  (a) .  This  implies  that  at  no  point  in 

L  Com (L ' )  ^ 

the  execution  represented  by  log  L  has  a  committed  transaction  read  a  value 
written  by  an  uncommitted  transaction  [H82].  The  class  of  recoverable  logs 
is  denoted  RC. 


2.2  Basic  Definitions 


Let  T  be  the  set  of  transactions  with  operations  accessing  the  data¬ 
base  D.  Let  %  be  the  set  of  logs  over  transactions  in  T.  Let 

HU  =  U  HU  (L)  be  the  Herbrand  universe  of  logs  in  ST. 

lcst 

A  database  state,  S,  is  a  mapping  from  D  to  HU;  S:  D-*-HU.  A  system 
state,  0,  is  a  triple  0=  (S^ ,SWI^ ,L^)  where  So  is  a  database  state, 

L  £i?  and  the  stable  write  information  at  state  0,  SWI  c  D  x  HU  x  T  . 

0  -  -  -  Cl¬ 

in  terms  of  the  informal  discussion  in  Section  1.2,  the  intended  inter¬ 
pretation  of  the  system  state  0  is  as  follows.  The  database  state  represents 
the  materialized  database:  S^fx)  =v>  means  that  the  value  of  x  stored  in 
the  materialized  database  is  v.  The  stable  write  information,  SWI^ ,  re¬ 
presents  the  audit  trail:  (x,v,Tj  ESWI^,  means  that  transaction  T_^  wrote 
value  v  into  data  item  x,  and  that  fact  is  recorded  in  the  audit  trail. 


Finally,  represents  the  execution  that  produced  the  current  system  state, 

0:  if  W.  lx]  €0P  ,  the  value  M  (W.  lx])  is  written  to  a  volatile  memory 

1  nj  nj  1 

buffer.  When  the  dbs  records  this  value  in  the  audit  trail,  the  new  system 

state,  say  O',  will  have  (x,M  (W . lx] ) ,T , )  € SWI  .  Similarly,  when  the 

Lq  1  i  a 

value  is  recorded  in  the  materialized  database,  the  new  system  state,  say 

0",  will  have  S  „  (x)  =M  (W.  lx]).  Hoping  that  this  informal  explanation 
o  nr  x 

provides  a  basis  for  understanding  the  basic  building  blocks  of  our  model 
we  revert  to  its  further  formal  development. 

Definition  2.1.  The  last  committed  writer  of  (data  item)  x  in  (system 

state)  o,  lew  (x) ,  is  the  transaction  T.  such  that  C.  ,  W.  tx]€OPT  and 
O  1  1  1  L0 

for  all  other  T.  such  that  C.,  W.lx]€OP_  we  have  W.lx]  <  W.lx].  Q_  , 
1  1  3  I'D  3  Ljj  l  2.1 

Note  that  by  the  assumption  concerning  the  initial  transaction  TQ, 
and  the  fact  that  write  operations  on  the  same  data  item  (being  conflicting 
operations)  must  be  related  in  the  partial  order  of  a  log,  1cwq(x)  is  always 
well-defined. 


Definition  2.2.  The  committed  database  state  of  (system  state)  o,  cso ,  is 

A 

the  database  state  defined  by:  CS  (x)  =  (W.lx]),  where  T.  =  lew  (x) .  D.  _ 

O  I'D1  10  2.2 

Let  I  be  the  set  of  all  system  states.  We  distinguish  a  special  system 

9  #  A. 

state,  o0,  called  the  imtial  system  state  where:  SQ  (x)  =gQx(  )/  for  all 

+  0 
x€D,  SWI  =0  and  L„  =  £. 1 
0„  0„ 

0  0 

Definition  2.3.  A  system  state  o  is  resilient  iff  for  all  x£D,  either 

(Rl)  S  (x)  ■=  CS  (x)  ,  or 
Z  0 


We  use  the  symbol  e  for  the  "empty  log",  containing  only  T_;  i.e.  OP  *  OP 


and  <  *<  . 


e  0 
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(R2)  (x,CS0(x) ,lcwo(x) )  € SWI  .  02 

Intuitively,  the  definition  of  resiliency  says  that  the  "correct"  value 
of  every  data  item  in  the  system  state  described  by  0  can  be  found  in 
stable  storage — either  in  the  "materialized  database"  (S^)  ,  or  in  the  "audit 
trail"  (SWIo) . 

If  (Rl)  or  (R2)  holds  for  a  particular  x  in  a  system  state  0  we'll 
say  that  o  is  resilient  for  x. 


lemma  2.4.  oq  is  resilient . 

A  A  ~  A 

Proof.  For  all  x €  D,  CS  (x)  =  M^fW  fx])  =  x  =  (x) .  Hence  (Rl)  holds 

°0  °0 

for  all  x€D  for  0_.  □_  . 

0  2.4 

2 . 2  Algorithms  and  their  Properties 

An  algorithm  (for  processing  data  operations)  is  a  set-valued  operator 
l 

on  I,  pft: '  I -*■  2  ,  such  that  if  O'  €pA(0)  then  L^,  5  L0- 

A  history  a  is  a  finite  sequence  of  system  states;  i.e.  a£I*.  A 
history  0i°2*,'°n  is  V  ^-compatible  iff  for  l^i<n. 

Definition  2.5.  An  algorithm  pft  is  correct  iff  for  any  (^-compatible 

history  0.0....0  ,  o  is  resilient.^  o 

0  1  n  n 

Intuitively,  an  algorithm  is  correct  if  it  transforms  the  system  state 
in  such  a  way  that  at  any  state  reachable  form  the  initial  one,  we  can  con¬ 
struct  the  "correct"  value  of  every  data  item  from  information  in  stable 
storage. 


*Here  and  throughout  the  paper,  by  0Q  we  mean  specifically  the  initial 
system  state,  whereas  by  0^  l^i^n  we  mean  arbitrary  elements  of  I. 


It  goes  without  saying  that  there  is  quite  a  gap  between  this  description 


of  an  algorithm  and  its  concrete  description  in  terms  of  a  realistic  pro¬ 
gramming  language.  The  assumption  underlying  our  model  is  that  each  transi¬ 
tion  from  o  to  each  element  of  jja(o)  oan  be  implemented  atomically .  if 
this  is  not  the  case,  our  results  are  mathematically  sound  but  pragmatically 
meaningless.  Often  such  an  atomic  implementation  is  far  from  obvious.  For 
implementation-related  issues  of  the  algorithms  to  be  described  later  see 
the  references  cited  in  Section  1.2. 


Definition  2.6.  An  algorithm  may  require  redo  iff  there  exists  a 

p^-coinpatible  history  Oo0i'*'0n  s*t* 

3T.3xlS  (x)  (W.  lx])  A  C.  €OPr  AT.  lew  (x)  ]  . 

i  O  L  i  l  L  l  0 

no  On 

n  n 


This  definition  says  that  starting  m  the  initial  system  state  the 
algorithm  has  produced  a  system  state  in  which  the  value  recorded  in  the 
"materialized  database"  for  an  item  x  was  written  by  a  committed  trans¬ 
action — but  not  the  last  committed  writer  of  x. 


Definition  2.7.  An  algorithm  may  require  undo  iff  there  is  a 

compatible  history  such  that 


3x IS  (x)  *Mt  (W.  lx])  A  C.  ?0PT  ] 

U  Ll  1  1  Li 

no  0 

n  n 


This  definition  says  that  starting  from  the  initial  system  state,  the 
algorithm  has  produced  a  system  state  in  which  the  value  of  some  data  item 
x  recorded  in  the  "materialized  database"  was  produced  by  an  uncommitted 
transaction. 


kT 


VvVvTTT 


l 

S 
> , 
o 

H 
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2.3  Well-formed  Algorithms 

In  this  subsection  we  develop  the  notion  of  a  "well-formed  algorithm" 
(wfa) .  Informally,  an  algorithm  is  well-formed  if  it  can,  at  any  time, 
respond  to  "correctly"  submitted  user  (transaction)  operations.  This  means, 
for  instance,  that  if  an  algorithm  has  produced  a  system  state  0,  it  must 


i  .  •  .■  v 


.*  */ 


A'.--'.1 


I 


K.V 


kfl 


L-> 


»• : 

11 


be  able  at  the  next  transition  to  move  to  a  system  state  o',  reflecting  that 
an  update  of  a  non-terminated  transaction^  was  received  by  the  algorithm. 

This  could  be  formally  stated  as  follows  in  the  "language"  of  our  model: 


VT.  IC.  ,A.  ? OPT  •  30' €  y  (0)  IL  ,  =LoW.  [x] ] ]+  . 
ill  Lg  A  0  01 


Note  that  our  definition  of  "correct  algorithm"  made  no  provision  for 

this.  So,  for  example,  the  algorithm,  y  ,  defined  by:  Vo  € L  U  . . .  (a)  =a 

Silly  si l ly  U 

is  correct  according  to  Definition  2.5.  (since  the  only  compatible 

histories  starting  with  0Q  are  of  the  form  Oo°0‘'’°0'  °0 

resilient  by  Lemma  2.4).  So,  sensible  algorithms  must  not  only  be  correct 
but  well-formed  as  well. 


Definition  2.8.  An  algorithm  yft  is  well- formed  iff  it  satisfies  both  of 
the  following  properties: 

WFA1:  For  any  y. -compatible  history  0„0,...0_,  there  is  some  o€p„(0_) 

~  A  oi  n  ah 

where  L_  =  L_  «a.  for  any  T.  s.t.  C.  ,  A.,  a.  f OP  ,  a.  € {r. [x] , 
001  1  l  l  1  L  11 

n  o 

W.Ix),  A.}.* 


I 


That  is,  a  transaction  not  yet  committed  or  aborted. 

^The  notation  "L^,  “L^oa/1  means  that  L0,  is  some  (arbitrary)  extension  of 
L  with  a.  tacked  at1the  end;  formally,  OP  = OP  U{a.}  and  there  is 
no  b.COP*1  s.t.  a.  <  b..  o'  o 

D  LC '  1  L0  •  3 

* 

Recall  that,  by  definition,  there  can  be  at  most  one  read  and  one  write  opera¬ 
tion  for  a  given  data  item  in  any  given  transaction.  This  restriction  is  only 
made  for  the  sake  of  keeping  the  notation  simpler:  the  results  in  this  paper 
in  no  way  depend  on  this  assumption. 


■  * -.j 
j 

v.'.-.i 

V’.-.J 

*  . 


-■■■Vi 


.A 


t 


.S  V 


kr  — 
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WFA2 ;  For  any  y^-ccmpatible  history  a*0^0^...0n  there  is  a  history 


8  *  T  T  . . . T  such  that 
12  m 

(a)  aB  is  y^-compatible ,  and 

(b)  L  *  L  »C.  ,  where  C.  ,  A.  fOPT 

T  O  1  11  1 1 

m  n  0 


Informally,  WFA1  says  that  a  wfa  can  at  any  time  receive  and  immediately 
start  processing  any  read,  write  or  abort  operation  for  a  non-terminated 
transaction.  WFA2  is  a  little  different  because  a  transaction  cannot 


necessarily  be  committed  immediately  after  a  commit  request  is  made.  The 
algorithm  for  processing  transaction  operations  may  have  some  "house 
cleaning"  to  do  before  it  can  commit  a  transaction  (e.g. ,  transfer  some 
values  from  "volatile  buffers"  to  the  "materialized  database”  or  the 
"audit  trail").  However,  we  do  require  that  a  wfa  be  able  to  commit  a  (non- 
terminated)  transaction  with  finite  delay  from  any  system  state  it  has 
reached  (starting  from  Og) . 

Definition  2.9.  The  set  of  hietozn.ee  associated  with  a  log  l,  relative  to 
algorithm  y  ,  h  (L) ,  is  defined  as: 

A  a 

h  (L)  *  {aja*o  o  ...a  is  y  -compatible  and  L  *l}.  d,  q 

n 

Informally,  h^d.)  is  the  set  of  histories  that  could  be  generated  by 
algorithm  y^  in  response  to  the  execution  represented  by  L. 


2.4  Properties  of  wfa's 

The  next  theorem  shows  that  wfa's  never  "get  stuck"  in  processing 
operations— i.e. ,  if  any  (legal)  operation  is  submitted  at  any  time,  a  wfa 
can  transform  the  current  system  state  to  one  that  reflects  the  fact  that  it 


.v  .  ' 


•W 
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received  and  started  processing  the  operation.  Recall  from  the  preceding 
section  that  this  was  the  motivation  for  introducing  wfa's. 

THEOREM  2.10.  If  yft  is  well- formed  then  for  any  Log  L,  h^ (L) 

Proof.  By  induction  on  L. 

Basis.  L  =  e  (the  empty  log) .  Since  is  l^-compatible  (in  fact  for  any 

y  — not  just  well-formed  y„!)  and  L  = e  by  definition  of  0  ,  we  have 
A  A  0q  U 

that  o  €  h  (£)  .  Hence  h  (L)y0. 

0  A  A 

Induction  Step.  Suppose  this  is  true  for  a  log  L'.  We'll  show  that  it  holds 
for  L  =  L'oa^,  where  a^  €  {r^  Ix]  ,W^  lx]  ,C^  ,A^}. 

For  a^  =  R^  lx]  (VT[x],A^,  respectively)  consider  (by  induction  hypo¬ 
thesis)  any  O-0,...o  €  h„  (L1)  and  let  0  ,,£y,(0  )  such  that 
OlnA  n+1  A  n 

Lq  =Lq  oR.Ix]  (L^  oWi[x],  L0  oAi ,  respectively).  0r+1  exists  by  WFA1. 

n+1  n  un  n 

Note  that  L_  =  L'  and  hence  L_  =  L'  *  R. [x] 

°n  °n+l  1 

(L'«W\[x],  L'oA^,  respectively).  So,  L0  *  L.  Thus,  0Q0^. . .°n+1 £ hA<D . 

n+1 

For  the  remaining  case,  a^  =  ,  consider  any  a  «=  0Q0^. .  .0^  €  hA(L' ) 

and  let  8-T.T-...T  be  such  that  otB  is  y  -compatible  and  L  *  L  «c. 

12m  A  T  0  l 

m  n 

6  exists  by  WFA2 .  Note  that  L'  *L  (by  inductive  hypothesis)  and  hence 

0 

n 

C.  ,  A.  £  OP  (otherwise,  L  =  L'®c,  would  not  be  a  log).  Since  L  =  L  ®C. 

i  i  L  i  v  j 

m  n 

and  L  =  L’  we  have  L  *L'oC.  *  L.  Therefore  aB  €  h  (L)  and  hence 
o  T  i  a 

n  m 

hA(U^0.  °2.1C 

LEMMA  2.11.  Let  yA  be  a  wfa  and  L  be  any  log.  Then  if 

o  o  . . . o  €  h  (L)  and  o  T .... T  €  h  (L) ,  CS  =  CS  . 

OlnA  01mA  °n  m 


Proof.  Immediate  from  the  fact  that  LQ  *  L^  ■  L. 

n  m 

LEMMA  2'.  12.  If  L  €  RC  (i.e. ,  L  is  a  recoverable  log)  then 
o  o 

CS  «M[TT  (L  )  )  . 

0  Com(L_)  0 


Proof.  By  definition,  for  any  x€d 


CS  (x)  -  M_  (W. [x]),  where  T  ■  lew  (x) 

0  L  i  i  0 

O 


(2.12.1) 


That  is,  W. Ix] ,  C.  €0Pt  and  for  any  j  j*  i  such  that  C.,  W.[x]6  0PT  , 
1  1  Lo  3  3  Lo 


W  Ix)  <  W  [x] .  By  definition  of  log  projection,  it  is  easy  to  see  that 

j  L0  1 


W.  [x] ,  C.  €  OP  ,  and  that  if  w  fx],c.  €0P_ 

1  1  ^Comdg)  L0  3  3  ^Comd^j)  iJ 

j/i,  then  W.[x)  <  .  ,  W. Ix].  Hence, 

3  "cantL a)lLa}  1 


for  some 


Mpr  ,T  «  M„  .  (W.  Ix) )  . 

Com  ( L  )  0  V  (L  )  l 

O  Ccxn(L0>  0 


(2.12.2) 


By  definition  of  recoverability,  we  have 


*  , t  ,  (W,  Ix] )  -  Mt  (W.  Ix]  ) 

TT  _  ,  (L  )  1  L  1 

Com(L  )  O  O 

a 


(2.12.3) 


From  (2.12.1),  (2.12.2),  (2.12.3)  we  get 


CS0(x)  =  for  any  x€D  . 


The  next  theorem  relates  wfa's  to  recoverable  executions. 


THEOREM  2.13.  If  U  V8  a  Wfa,  L€RC,  0  0  . .  .0  €h  (L), 

A  V/  X  T1  A 


o  T . . . . t  €h  Or  .  .  (L) )  then  C5  *  CS  . 
0  1  m  A  Com(L)  o  T 

n  m 


Proof.  By  I*mn»  2.12,  ,  »*%,  >!  ’  "  [7Com<U  <LI  1  ’  and 

CST  ■«'»«,,  ,  ILt  11  *Ml’con,"»  (L))’  ‘  “"’coma)  IL>  *  M1\on„L,  ll> 1  • 

in  T  m  coxniu 

Hence,  CSQ  -  .  °2.l 


n  m 


Informally  this  theorem  says  that  the  committed  database  state  of  any 


system  state  reached  by  a  wfa  in  response  to  the  operations  of  a  recoverable 


execution  E  is  the  same  as  the  committed  database  state  of  a  system  state 


reached  by  the  same  wfa  in  response  to  execution  E'.  which  is  the  same  as 


v'V  v  ■/  >’  ’  s'  s'  . 


vv; :  •  ;-;.v 
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E  except  that  the  operations  of  aborted  or  uncommitted  transactions  of  E 
never  happened.  Note  that  the  committed  database  state  at  any  time  t  is 
what  we  want  to  reconstruct  as  the  "correct"  database  state,  if  a  system 
failure  happens  at  t. 

Now,  if  the  algorithm,  in  addition  to  being  well-formed,  is  also 
correct,  we  have  enough  information  in  "stable  storage"  to  reconstruct  that 
database  state. 

In  the  following  section  we  shall  study  four  algorithms  (the  four 
classes  mentioned  on  Section  1.2)  and  shall  prove  that  they  are  correct 
and  well-formed.  From  Theorem  2.13  and  these  anticipated  facts  we  conclude 
that  any  of  these  four  algorithms,  coupled  with  a  scheduler  that  produces 
recoverable  logs,  provides  tolerance  to  the  faults  under  consideration — 


that  is,  transaction  and  system  failures. 


3.  THE  FOUR  BASIC  ALGORITHMS 


3. 1  Introduction 


In  this  section  we  examine  in  detail  the  four  algorithms  introduced 
in  Section  1.2.  First  we  define  these  algorithms  in  the  style  of  the 
previous  section;  that  is,  as  (non-deterministic)  state  transition  functions 
We  define  four  such  functions  y  ,  ^11 '  ^III  ^IV  corresP°ntiin9  > 

respectively,  to  the  four  algorithms.  The  functions  are  defined  by  listing 
"types"  of  transitions  as  in  [Ly82].  Each  transition  type  is  characterized 
by  (a)  a  set  of  conditions  on  states  and  (b)  state  transformation  rules , 


describing  how  the  new  state  is  to  be  obtained  from  the  old.  The  idea  is 
that  if  a  state  o  satisfies  the  conditions  of  a  transition  type  of 
algorithm  yft,  and  o’  is  obtained  from  0  as  described  by  the  state  trans¬ 
formation  rules  of  that  transition  type,  then  o’Cy^Co). 

For  every  one  of  these  four  algorithms  (state  transition  functions) 
we  prove  rigorously  its  correctness,  good  formation  and  properties  with 
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Transformation  rules: 


S  *  S 

o'  a 


SWI  ,  -  SWI 

o  o 

Lo,  =  V*.  IxJ 


1-2;  [Submit  a  write  operation] 


Conditions : 


3t.3x[c.  ,a  ,w  [x]  Cop  ] 

1  1  1  1  Lq 

Transformation  rules: 


S  ,  *  S 
O'  o 

SWI  ,  *  SWI 
0  0 

■  VVXl 


1.3:  [Record  an  update  in  the  "audit  trail"] 


Conditions : 


3T.3x[C.  ,A.  f  OP  AW.  [x]  €  OP  ] 
1  1  1  Lo  1  L0 

Transformation  rules: 


S  *  S 
O'  o 

SWI  ,  -  SWI  u{(x,M_  (W .  [x] )  ,T .  )  } 

O'  O  L  1  i 

o 

L  ,  -  L 
O'  O 


1.4:  [Record  an  update  in  the  "materialized  database"] 


Conditions : 


3T.3xi(x,M  (W .  [x] ) ,T . )  €  SWI  ] 
1  1*  110 
O 


'We  adopt  the  convention  that  the  existentially  quantified  objects  in  the 
conditions  bound  any  free  objects  in  the  transformation  rules.  Hence  the 
subscript  i  and  the  x  in  R^Ix]  here,  refer  to  the  transaction  T^ 
and  the  data  item  x  described  by  the  conditions. 


-2 


Transformation  rules: 


s0(y) 


s0.(y> 


mt  (W.  [x]  ) 

Lo  1 


if  y  *x 


if  y  *  x 


SWI0,  -  SWI0 

Lo-  x  La 


I- 5 :  [Commit  a  transaction] 


Conditions : 


3T  [C  ,A  COP  A  VW  [x]  €  OP  I(x,Mt  (W.  [x])  ,T.)  6SWI  ]  ] 

l  L  1  1  0 


Transformation  rules : 


V  "  so 
SWV  =  SWIo 

Lc  *  Vci 


1*6:  [Abort  a  transaction] 


Conditions 


3T.  [C. ,A.  P OP  ] 

All  L_ 

0 

Transformation  rules 


so-  =  so 

SWI0«  *  SWIo 

V  =  VAi 


1  • 2 :  (Restore  update  of  aborted  transaction  in  "materialized  database”] 


Conditions : 


3T.3x[A.  €  OP.  ASJx)  (W.[x])J 


Trans formulation  rules: 

( So & 

if  y  /x 

s0,(y)  = 

V  cs0(x) 

if  y  =  x 

SWI  ,  =  SWI 
0 '  0 


1.8:  iDiscard  unnecessary  information  from  the  "audit  trail"] 
Conditions : 

3T. 3x I IC.  €  OP  AT.  /  lew  (x) ]  vA.  6  OP  ) 
i  i  Lq  l  0  1  L0 

Transformation  rules: 


SWlo,  =  SWIo-{(x,Ml  (WiIx]),Ti)} 


theorem  3.2.  Algorithm  is  correct. 

Proof.  For  any  y^-compatible  history  Oo0l’‘’0n'  we'^  s*low * 
induction,  that  is  resilient,  O^k^n. 


Basis : 


is  resilient  by  Lemma  2.4. 


Induction  Step:  Suppose  is  resilient  for  some  k 

Uj-compatibility  of  o^o^...O^  we  have  that 
transition  type. 


,  0  ^  k  <  n .  By 

(0,  )  .  Consider  each 
k 


Types  1.1-3  and  6:  In  all  these  cases,  lew  =  lew  ,  CS  =  CS  , 

k+1  K  k+1  k 

S  -  S  and  SWI  =>  SWI  .  Thus  if  (Rl)  or  (R2)  holds  for  any 

°k+i  v  Vi-  °k 

x  €  D  in  it  will  hold  for  x  in  ok+1-  BY  inductive  hypothesis  then 


0,  . , 


is  resilient. 


(  1.4: 

Let 

T.  ,x  be  such  that  S 

1  k+1 

(x)  =  M.  (W. 

L0  1 

[x] )  /  S0  (x) .  Since 

K 

r 

lew  , 

CS0  ‘  CSo  '  So  (y)  " 

s  (y) 

for 

all  y  €  D  -  {x}  and 

k+1 

°k 

k+1  k  k+1 

°k 

SWI  , 

if  (Rl)  or  (R2 )  holds  for 

y  in 

it  also  holds  for  y 

k+1 

°k 

K 

SWI 


in  0^+1.  By  inductive  hypothesis  then,  0k+1  is  resilient  for  all 

y€D-{x}.  Also,  if  T.  =  lew  (x)  we  would  have  that  S  (x)  =  CS  (x) 

1  °k+l  °k+l  °k+l 

i.e.  ,  (Rl)  would  hold  for  x  in  0^+^.  The  only  remaining  case  is  when 

T\  1  lcw0  (x) .  Let  T£ = lcwQ  (x),  and  let  p  be  such  that  LQ  =  LQ  oC£. 

k+1  k+1  p  p_2 

That  is.  T0  committed  in  the  type  1.5  transition  from  0  ,  to  0  . 

x  -1  p-1  p 

It  is  easy  to  see  that 


T£  =  lcw0  (x) 


for  p  <  r  <  k  +  1. 


(3.2.1) 


By  the  1.5  conditions,  we  have  that  (x,M  (W0[x]),T0)  € SWI  .  Clearly, 

l  x.  a  a  , 

P-1 


\  <Vx])  =  ML 

o  o  . 

r  p-1 

all  L„  .  We  claim  that 
0 

r 


(W£[x])  for  p<r<k+l  since  Lo 


p-1 

is  a  prefix  of 


p-1 


(x,M  (W_  [x] )  ,T0)  €  SWI  for  p-l<r<k+l  . 


-L0  -  £ 

r 


(3.2.2) 


For,  suppose  not.  Then  for  some  q,  p<q<k,  (x,M  (W-  [x] )  ,T» )  €  SKI 

Lj  jc  X  o  _ 

q-1  ^ 

SWI  .  By  inspection  of  u  ,  the  transition  from  0  to  0  can  only 

Oq  I  q-i  q 

be  of  type  1.8  (because  that  is  the  only  type  in  which  elements  of  SWI 

we  must  have,  by  the  1.8  conditions,  that 


are  deleted).  Since  A.  POP 

X  L 


q-i 


T£  ¥  lcwQ  (x)  but  this  contradicts  (3.2.1),  as  p<q-l<k.  Therefore, 
q-1 

(3.2.2)  is  true  and  in  particular,  since  T£  =  Icw^  (x)  331(3  thus 

'  °k+l 

CS  (x)  *  M  (W.lx]) ,  we  get  that  (x,CS  (x),lcw  (x) )  € SWI 

°k+l  X  ,  k+1  °k+l  °k+l 

k+1 

Hence,  (R2)  holds  for  x  in  0 


i » 


as  needed. 
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Type  1.5:  Let  be  the  transaction  that  commits  in  the  transition  from 

O.  to  *0  ,  j  i.e.  L  *L  «C  .  Consider  any  x  €  D.  If  T.  /  lew  (x) 

K  A  k+1  k  1  °k+l 

then  lew  (x)  =  lew  (x) ,  CS  (x)  =  CS  (x) .  Moreover,  S  =  S 

k+1  k  k+1  k  k+1  Ck 

and  SWI  =  SWI  ,  and  thus  if  (Rl)  or  (R2)  holds  for  x  ir  o,  it 
°k+l  °k  k 

also  does  in  ak4l-  By  induction  hypothesis,  then,  °k+1  is  resilient 

for  x.  If,  on  the  other  hand,  T.  =  lew  (x)  ,  certainly  K  lx]  €0? 

1  *♦!  1  X 

and  by  the  1.5  conditions,  (x,M  (Vi.  [x]),T.)  €  SWI_  .  Since  SKI 

L  110  0,  , 

0R  k  k+1 

SWI  and  CS  (x)  =  MT  (W.  [x])  =  MT  (W.lx)),  we  have  (x,CS„  (x) , 

k  k.i  Vi  °k  k'J 

lew  (x) )  € SWI  ,  proving  that  (R2)  holds  for  x  in  0  ,  as  needed. 

k+1  k+1  k+1 

Type  1.7:  Let  T. ,x  be  such  that  A.  €  OP.  and  S  (x)  =  CS  (x) .  For 
i  1  L  0  0 

0^  k+1  k 

any  y  €  D  -  {x} ,  if  (Rl)  or  (R2)  holds  for  y  in  o  it  also  does  in  o  . 

K  K"*-  Jl 

by  inductive  hypothesis,  then,  +  1  is  resilient  for  y€D-{x}.  Since 

CS  =  CS  we  also  have  that  $  (x)  =  CS  (x)  and  thus  (Rl)  holds 

k+1  k  k+1  k+1 

for  x  in  0.  ,  , .  So  o,  ,  is  resilient. 
k+1  k+1 

Type  1.8:  Let  T.  ,x  be  such  that  (x,M.  (W.  [x] )  ,T.  )  €  SWI_  -SKI 

-+*■ -  i  L  i  i  0,  0,  , 

o,  k  k+1 

k 

Since  lcw„  =  lew  ,  CS„  =  CS„  and  S  ,  =  S  ,  if  (Rl)  holds  for  any 
o,  ,  o,  a,  ,  o,  a  a 

k+1  k  k+1  k 

y  €  D  in  0,  it  also  holds  in  0,  Also,  if  (R2)  holds  for  y£D-{x} 

k  k+1 

in  0,  it  holds  in  0,  .  as  well.  So,  let's  examine  the  remaining  case, 

k  k+1 


namely  when  (R2)  holds  for  x  in  o  .  We  then  have  (x,CS_  (x) , 

k  0, 

k 

lew  (x) )  € SWI  .  But  (x,CS  (x) ,  lew  (x) )  6 SWI  -SWI  ,  only  if 
°k  k  k  °k  k  k+1 

lew  (x)  =T..  By  the  1.8  conditions,  however,  either  A.  6 0PT  ,  in  which 

°k  1  1  \ 

k 

case  T.  /  lew  (x)  ,  or  C.  €  OP.  and  T.  /  lew  (x)  ,  again.  Since  we 
10.  l  L  io, 

k  o,  k 

k 

also  have  that  CS  =  CS  and  lew  =  lew  ,  we  get 
°k+l  °k  °k+l  °k 

(x,CS  (x),lcw  (x) )  € SWI  establishing  that  (R2)  holds  for  x  in 

°k+l  °k+l  °k+l 


This  concludes  the  induction  step  and  thereby  the  proof.  o3 

THEOREM  3.3.  Algorithm  Uj  may  require  both  redo  and  undo . 

Proof.  Consider  the  history  0  q°  i°  2°  2,°  4C  s°  (>  defined  by; 


o 

D 

is  the 

initial 

system  state) 

sn 

*  So  ' 

SWI  = 

SWIo  *  Lo  "  L0  °Vxl 

°i 

°0 

°1 

°o  °i  °o  1 

S 

■  V 

SWI02  = 

SWV  La  -vWaIyl 

JL  d*  X 

S 

■S' 

SWIo3  " 

SWI0  U  {(x,glx(  )  .T^}, 

L_  =  L 

0  0 

3  2 

S 

"S' 

SWIo4  = 

SWIo  U { (x,g2y (  ) ,T2) }, 

L  =  L 

0  0 

4  3 

V2’  • 


s  (z)  if  z^x 
4 


SWI 


g,  (  )  if  z  * x 
lx 


*  SWI  ,  L  =  L 
0-  0_  o 

5  4  5  4 


so -v  swv  =  swio'  V‘Vc2 

6  5  6  5  6  5 


It  can  be  readily  verified  that  this  is  a  U^-compatible  history  (0Q  goes 
to  0  and  01  to  02  by  1.2  transitions,  02  to  03  and  03  to  04 
by  1.3  transitions,  04  to  ©5  by  an  1.4,  and  to  by  an  1.5 


transition),  that  (y)  =  g^y (  )  *  M^  (W^fy]),  Cq  €  OP^ 


and 


Tq  /  lcwo  (y)  =T2  (showing  that  may  require  redo),  and  that 

6 


S  (x)  =  g,  <  )  «  M„  IW.  [x])  but  c,  ?0PT  (showing  that  U  may  require 


theorem  3.4.  Algorithm  ^  is  well-formed. 


Proof.  Satisfaction  of  WFA1  by  y  follows  immediately  from  transi¬ 


tion  types  1.1,2  and  6.  It  remains  to  show  that  y  satisfies  WFA2.  Let 


a  =  o0o1...an  be  any  yj-compatible  history.  Let  be  such  that 

C.  ,  A.  f  0PT  . 


n 


Case  1:  If  for  all  W  (x)  €  OP  ,  (x,MT  (W.  [x] )  ,T. )  £  SWI  ,  the  1.5 

1  Li—  110 

°n  0  n 

conditions  are  satisfied  and  we  may  define  T,  €y  (0  )  such  that 

1  In 


Lt  = °c..  Since  y^compatible ,  we  have  that  y^ 


satisfies  WFA2 ,  in  this  case. 


Case  2 :  Suppose  there  is  some  x  such  that  W.  [x]  €  OP 

1  L 


yet 


(x,M  (W  [x] ) ,T  )  £ SWI  .  Let  x. ,x« , . . .  ,x  be  all  sucfi  x's.  Because 
hn  iio  i  ^  m 


n 


n 


V  A^  £  0PL  and  W^lx^]  ^  0PL  '  t*ie  conditions  for  1-3  transitions  are 

o 

satisfied  for  l<k<m.  We  may  then  define  TR  for  O^k^m  as 


follows.  Let 


T0  “  °n 


and 


S„.  =  S 

Tk+1  Tk 


SWI  =  SWI  U  {  (x.  ,M  (K.  lx.  ])  ,T.)} 
Tk+1  Tk  *  T,  1  *  1 


JT  "  LX  ' 
k+1  ‘k 


for  0<k^ 


m 


Note  that  T^+1  E  y^  (t^)  0<k<m,  by  1.3  transitions.  Clearly,  for  all 


W.  [x]€0PT  ,  (x ,M  (W.  lx])  ,T. )  €  SWI  ,  and  since  MT  (W.lx])=MT  (W.lx]) 

1  h  L  111  Jj  1  L  1 

°n  °n  m  °n  Tn 

C.  ,A.  £0PT  the  condition  for  1.5  transitions  are  satisfied  at  T  and 
l  l  l  m 

Tm 

we  may  define  T  €  y_  (T  )  such  that  L  =L  oC.  .  Let  p  =  T,T. _ T 

m+1  I  m  t  ,  T  l  12  m+1 

m+1  m 


aB  is  clearly  y  -compatible  and  since  L  =  L„  ,  L_  _ _  ,,, 

I  TOT  Ol  3 


m 


=  L0  °C,  .  Thus  y, 
'n  'm+1  n 


satisfies  WFA2  in  this  case,  as  well. 


3. 3  Algorithm  II  (undo  but  no  redo) 


ALGORITHM  3.5.  O'CU^CO)  iff  one  of  the  following  is  the  case. 

II. 1:  [Submit  a  read  operation] 

Conditions : 


3T.3x[C. ,A. ,R.  [x]  9 OP  ) 
1  111  L_ 

0 


Transformation  rules: 


V  *  so 


SWI  ,  *  SWI 
0 '  0 


Lo*  *  Wxl 


II. 2:  [Submit  a  write  operation] 


Conditions : 


3T .  3x  [C.  ,A.  ,W.  [x]  gOPT  ) 
1  111  L 

0 


Transformation  rules: 


s0-  *  so 


SWI  ,  -  SWI 
O'  o 


V  “  VWilx] 


II .  3 :  [Transfer  the  last  committed  vc.iu«.  of  a  data  item  from  the 

"materialized  database"  to  the  "audit  trail"] 

Conditions : 

3T. 3x [T.  «  lew  (x) ] 

1  1  0 


Transformation  rules: 


SWI  *  SWI  U{(x,M  (W.  [x]),T.)} 

O  D  ti  1  1 

0 

L  *  L 
O'  O 


II. 4:  [Overwrite  the  value  of  x  in  the  "materialized  database"  with  an 
uncommitted  value  of  x] 

Conditions : 

3T.3x[W.  [x]  €  OP,  aC.,A.  gOPT  a 
i  l  L  l  i  L 

0  0 

IT  .  -  lew  (x)  «•  (W .  [x]  <  W.  [x]  A  (x,CS  (x)  ,T.)  €SWI  )]] 
DO  D  1  O30 

Transformation  rules: 


s  (y) 


s0.(y)  - 

Im.  (W,  [x] ) 

Lo  1 

SWI  ,  =  SWI 
0’  0 


L  =  L 
O'  o 


if  y  ¥  x 


if  y  *  x 


II. 5:  iCommit  a  transaction] 


Conditions : 


3T. [C. ,A.  ? OP  A Vx  [W.  [x]  €  OP  A  lew  (x)=T.  A 
xxi  Xi  x  o  3 

a  0 


W.  [x]  <  W.  [x]  (x)  (W.  [x])] 

D  1  0  L  1 

c  0 


Transformation  rules: 


S  =  S 
O'  0 


SWI  ,  =  SWI 
O'  o 


L  ,  =  L.C, 
O'  0  1 


%  S 


%  .  %  -  > 


v  .*v  .  v 


*“»  * 
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V.'r 


AV 


*\ 

V 


IV. ' 


II. 6:  [Abort  a  transaction] 
Conditions: 


3T.  [C.  ,A.  1 0PT  ] 
x  1  1  L0 


Transformation  rules: 


V  -  so 


SWI  ,  *  SWI 
o  o 


L  ,  *  LoA. 
O'  0  1 


II. 7:  [Restore  value  of  a  data  item,  updated  by  an  aborted  transaction,  in 
the  "materialized  database"] 

Conditions: 


3T.3x[A.  ,W.  [x]  €  OP  AS  (x)=Mt  (W .  Ix]  >  ] 
ill  L0  L0 


Transformation  rules: 


[S  <y) 

if  y  ¥  x 

m 

V(y)  “  } 

m 

'  V 

if  y  =  x 

SWI  ,  *  SWI 

o  o 


L  =  L 
O'  0 


II . 8:  [Delete  unnecessary  information  from  the  "audit  trail"] 
Conditions : 


3Ti3xlTi  /  lcwQ(x) ] 


Transformation  rules: 


S  .  =  S_ 
O'  0 


SWI  ,  =  SWI  -  { (x,v,T. )  |v  €  HU} 
O  0  1  1 


L  •  =  L 

O'  O 


□ 


3. 


theorem  3.6.  Algorithm  y  is  correct. 

Proof.  Let  °o0i*‘‘0n  a  'Jp”cornPat;’-b^e  history.  We  show,  by 
induction  on  k,  that  0^  is  resilient  for  0<k<n. 

Basis :  By  Lemma  2.4,  0Q  is  resilient. 

Induction  Step:  Suppose  is  resilient  for  some  0<k<n.  By  y  - 

compatibility  of  0  o  ...o  we  have  that  o  €y  (°v)*  Consider  each 

u  i  n  K+  x  p  K 

transition  type. 

Types  II. 1-3,  and  II. 6:  In  all  these  cases  we  have  S  «  S  ,  and 

k+1  k 

SWI  3  SWI  .  Also,  since  L_  =>  L„  and  exactly  the  same  trans- 

0,  ,  —  0,  0.  A ,  —  0, 
k+1  k  k+1  k 

actions  are  committed  in  the  two  logs  (i.e.,  Com(L  )  =Com(L  )), 

k+1  k 

it  follows  that  CS0  ®  cs0  •  Thus,  if  (Rl)  or  (R2)  hold  for  x  in 
they  also  hold  for  x  in  ak+^.  induction  hypothesis  then, 
is  resilient. 


Type  II. 4:  Let  x€D  be  the  data  item  such  that  S  (x)  /S  (x) .  Note 

k+1  k 

that  for  all  y€D-{x},  if  (Rl)  or  (R2)  hold  for  y  in  0^  it  also  holds 

for  y  in  By  inductive  hypothesis  then,  it  suffices  to  show  that 

°k+l  *s  resii^ent  ^or  x*  By  the  conditions  on  II.4-type  transitions  we 

have  that  (x,CSc  (x)  .lcw^  (x))  €  SWI^  .  Since  L^  =  LQ  we  have 

k  k  k  k+ 1  k 

lew  =  lew  and  CS  =  CS  Also,  by  the  II. 4-type  transf ormation 
o,  ,  o,  o,  . ,  o, 

k+1  k  k+1  k 


rules  we  have  SWI 


“SWI  .  Therefore,  (x,CS  (x),lcw  (x))€SWI 
°k+l  °k  °k+l  °k+l  °k+! 


Thus  (R2)  holds  for  x  in  0.  , . 

k+1 


Type  II. 5:  Let  T^  be  the  transaction  that  committed  in  the  transition 

from  to  0k+1  (i.e.  T^  is  such  that  LQ  *  o^).  Consider 


any  x€D.  There  are  two  cases. 


Case  Is  T.  *  lew  (x)  .  Then  W.Ix]€OP  and  by  the  II. 5-type 

k+1  1  V 

k 

conditions,  S„  (x)  «  M.  (W.[x]).  Since  S„  ■  S„  ,  and  because 
0,  L  i  0,  ,  0, 

k  0  k+1  k 


M  (W  lx])  *  M  (W. Ix])  (as  L  is  a  prefix  of  L  ) ,  it 

Li  1  L_  1  u,  O 


follows 


that  S  (x)  ■  M  (W. lx]).  Moreover,  because  T.  •  lew  (x) , 

1  V. 

CS_  (x)  *  M  (W.Ixl).  Thus  S„  (x)  -  CS  (x) .  That  is,  (Rl) 

°k+i  Lok+1  1  °k+l  °k+l 

holds  for  x  in  0k+^,  *n  this  case. 

Case  2:  T.  /  lcw_  (x) .  It  is  then  easy  to  see  that  lcw_  (x)  *  lcw_  (x) 
-  1  °k+l  °k+l  °k 

and  CS  (x)  *  CS  (x) .  Since,  by  the  type  II. 5  transformation  rules, 
k+1  k 

S  *  S  and  SWI  »  SWI  *  <B1)  or  (R2)  holds  for  x  in  o , 

°k+l  °k  °k+l  °k  k+1 

if  it  holds  for  x  in  0^.  By  inductive  hypothesis,  then,  °jc+^  is 

resilient. 


1.7:  Let  T. ,x  satisfy  the  II. 7  conditions.  Since  A .  €  0P_  , 

—  i  i  L 


clearly  T.  /lew  (x) .  Since  S  (x)  *  M.  (W. ix])  and  T.  /  lcw„  (x)  , 

1  °k  °k  Lov  1  A  °k 

k 

CS0  (x)  /  S0  (x) .  That  is,  (Rl)  is  not  satisfied  for  x  in  0  .  Because 

0^  is  resilient,  by  inductive  hypothesis,  (R2)  must  be  satisfied  for  x 

in  0.  ,  i.e.  (x,  CS  (x),lcw  (x) )  € SWI  .  Since  by  the  II. 7  transforma- 

k  k  °k 


tion  rules  L_  *  L_  ,  we  have  that  lcw_ 


lcw_  and  CS_  *  CS_  . 


ri.-x-.w  , 


v- .■  **  *.  ■.■  ■.■  ■  ■  ■  ■  .1 V ;.i  ■ 


K*> 


.** 
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Also,  by  the  sane  transformation  rules,  SHI  *  SWI  .  Thus 

o,  .  0, 

k+1  k 

(x,CS  (x)  ,  lew  (x) )  € SWI  ,  proving  that  (R 2)  holds  for  x  in 
k+1  k+1  k+1 

i.e.  °k+1  is  resilient  for  x.  For  all  y€D-{x},  it  is  easy 

to  see  that  (Rl)  or  (R2)  holds  for  y  in  0,  ,  if  it  does  in  0,  .  By 

k+1  k 

inductive  hypothesis  then  °k+1  is  resilient  for  y€D-{x},  as  well. 

Type  II. 8:  Let  T^x  satisfy  the  conditions  of  II. 8  transition  type. 

Clearly,  for  any  y€D-{x},  if  (Rl)  or  (R2)  holds  for  y  in  it  also 

holds  in  °k+1«  40(3  by  inductive  hypothesis  then,  °k+j  is  resilient  for 

such  y.  Also  if  (Rl)  is  satisfied  for  x  in  it  will  be  satisfied 

in  °k+i*  The  only  case  of  concern  then,  is  if  (R2)  is  satisfied  for  x 

in  0.  .  Then  (x,CS„  (x) ,  lew  (x) )  € SWI  .  By  the  II. 8  conditions, 
k  0,  o,  0, 

k  k  k 

T.yicw  (x)  ,  hence  (x,CS„  (x)  ,lcw„  (x))€SWI 
1  0,  0,  0,  0,  , 
k  k  k  k+1 

L„  =  L  ,  lew  a  lew  and  CS„  *  CS„  .  Therefore, 

°k  °k  Vi  °k 


Also,  because 


(x,CS  (x)  ,  lew  (x) )  €  SWI 

°k+l  °k+l  °k+l 

in  o,  So  0,  ,  is  resilient  for  x,  too. 

k+1  k+1 


establishing  that  (R2)  holds  for  x 


This  concludes  the  Induction  Step  and  thereby  the  proof. 


theorem  3.7.  Algorithm  y  never  requires  redo  but  may  require 


3.6 


undo. 


Proof.  Let  °oai*‘’°n  3  ^Ii_comPat^l3^e  history,  and  let  T^.x 

be  such  that  S  (x)  =  M  (W.Ix])  and  C.  6  OP  .  To  prove  that  y 


L  i 
0 

n 


II 


never  requires  redo,  it  suffices  to  show  that  T^  =  lcw^  (x) .  Let  k<n 

n 

be  such  that  S„  (x)  i  Mt  (W. [x])  and  S  (x)  =  M  (W. [x ) )  for 
0,  ,  L_  1  0  L„  1 

k_1  0k-l  r  °r 

k<r<n  (3.7.1).  (If  no  such  k  exists,  then  S  (x)  =  M  (W  I  x  ] ) 

0  L  0 

n  O 

n 

and  T  =  lew  (x) ,  and  we  are  done.)  By  inspection  of  it  is  clear 


v.--  . 

-  .  ^  j 
•  -•] 


-U 


'W 


m 

V'+P.l 


*-  -V\ 


that  the  transition  from  o  to  a.  can  only  be  of  type  II. 4.  Let 

K“  1  K 

Tj  »  lcw^  (x)  (3.7.2).  By  the  conditions  of  II. 4  transition  type  we 

have  that  W.  Ix]  <  W. (x)  (3.7.3).  Also,  by  the  same  conditions 

3  O  1 

k-1 


we  have  that  C.  POP 

X  L 


and  hence  C.  POP  (recall  that  the  transition 

'O  1  Lo 

k-1  k 


from  to  is  of  type  II. 4,  so  7\  can't  conrnit  during  that 

transition).  Because,  by  assumption,  C.  €  OP  ,  we  must  have  that  for 

1  Lo 

n 

some  p,  k<p<n,  Lq  *  Lq  "CL  (3.7.4). 

P  P-1 

Claim:  For  any  q^p.  k<q<n,  lcw^  (x)  =  lcw^  (x) . 

q  q-1 

Proof  (of  the  Claim) :  By  definition  of  lew,  lcw^  (x)  /  lew  (x) , 

q  °q-i 

only  if  L  *  L  °C  and  K  [x]  €  OP.  ,  i.e.  only  if  in  the  transition 
OC,s  s  L_ 

q  q-1  O 

q 

from  a  ,  to  o  ,  a  transaction  T  ,  which  contains  a  write  operation  on 
q-1  q  s 

x,  commits.  So  we  only  need  to  consider  such  transitions  which,  as  can 

be  easily  seen  can  only  be  of  type  5.  Let  lcwo  (x)  *  T^.  By  the  type 

q-1  r 

II. 5  conditions,  we  must  have  that  if  W  lx)  <  W  Ix]  then 

r  L_  s 

q-i 

S  (x)  ■  ML  (W  Ix]).  But  since  q^p,  T  j* T.  and  this  could  not  be 
O  ,  L„  s  si 

q  Vl 

the  case  because,  by  (3.7.1)  S  (x)  =  M  (W.  [x])  for  all  k<r^n  and 

r  Lc  1 

r 

in  particular  for  r  =  q-l  (recall  that  k<q<n).  Therefore,  we  must 

have  that  W  Ix]  <  W  [x]  and  since  L_  =>  L_  ,  W  Ix]  <  W  [x] . 

s  L_r  0  —  0  ,  s  L„  r 

O  ,  q  q-1  0 

q-1  q 

Hence  lew  (x)  ^  T  and  therefore  lew  (x)  *  T  *  lew  (x) .  □  . 

Os  0  r  0  ,  Claim 

q  q  q-i 

Now,  by  using  (3.7.2),  the  Claim,  and  induction  on  q  for  k^q<p, 

we  get  that  lew  (x)  *  T..  By  (3.7.4)  C,  €  OP  ,  and  by  (3.7.3)  (and 

O  T  1  L_ 


the  fact  that  L  =>  L  )  we  have  that  lcw_  (x)  = T, .  By  using  this, 


the  Claim  and  induction  on  q  for  p<q<n,  we  get  that  lew  (x)  =  T.  , 

n 


as  wantfed. 

To  show  that  p^  may  require  undo,  consider  the  history  °0OiO2O3 
defined  as  follows: 

(0Q  is  the  initial  system  state) 


S0 

*  s 

,  SKI 

=  SWIo  '  Lc 

=  L  °W.  lx) 

°1 

°0 

ai 

°o  a 

1  °o  1 

s 

,  swi^ 

=  SWIo  LJ  {  (x 

'^0xC  }  'V}'  Lo  =  V 

2  1 

1 

\W  > 

if  y  ¥  x 

S0 

,  SWI  =  SWI  ,  L  =  L 

°3 

(y)  «  j 

bi*1  ] 

if  y  =  x 

°3  °2  °3  °2 

It  can  be  easily  checked  that  this  is  a  p^-compatible  history  (0Q  goes 

to  by  means  of  a  type  II. 2  transition,  0^  to  c ^  by  means  of  a 

type  II. 3  transition,  and  0^  to  o ^  by  means  of  a  type  II. 4  transition) 

and  that  S  (x)  *  g  (  )  =  M  (W  [x]),  while  C.  P  OP  (no  type  5 

3  •Lx  hn  1  1 

3  3 

transitions)  .  Oo°l02°3  s^ows  ^at  p^  may  require  undo.  □. 

THEOREM  3.8.  Algorithm  p  is  veil- formed. 

Proof.  That  p.^  satisfies  WFA1  follows  directly  from  transition 
types  11.1,2  and  6.  We  proceed  to  show  that  p^  satisfies  WFA2.  Con¬ 
sider  any  p^-compatible  history  a  =  0g0^...0n.  Let  be  any  trans¬ 

action  such  that  A.,  C.  P OP, 

1  1  Lo 

n 

Case  1:  If,  moreover,  for  all  W.  [x]  €  OP,  such  that  W_.  [x]  <T  W.  [x]  , 

-  1  Lo  3  Lo  1 

n  n 

«  lew  (x) ,  it  is  the  case  that  S  (x)  =  M  (W  [x]),  the  II. 5 


where  T , 


conditions  are  satisfied  and  we  may  define  T  €p  (o  )  by  a  type  II. 5 

1  II  n 

transition.  Taking  6“"^,  we  have  that  aB  is  y^-compatible  and 

L  »  L  oC.  ,  proving  that  WFA2  holds,  in  this  case. 

T  0  1 

1  n 

Case  2:  Now  suppose  that  there  exists  some  x€d  such  that  W.lx]  <  K 

3  Lo 

n 

where  T.  *  lew  (x) ,  yet  S  (x)  (W.lx))  (3.8.1).  Let  x,  ,x, ,... 

J  o  jj  i  l  + 

n  no 

n 

be  all  such  x's.  Define  history  T  T  ...r  as  follows: 

l  ^  l 


Let  Trt  =  0  ,  and  denote  T.,  ,  =  lew  (x.  )  ,  for  l<k<m. 

On  i  (x,  )  ox 


j<V 


C  -  SWI  U  {(xk,M  {wi(v  )Ixi])'Ti(3c  ))} 

\  Vi  *  \  ,  3(V  1  ](V 


L  *  L 
T  T 

k  k-1 


for  l<k<i 


It  is  easy  to  see  that  the  transition  from  ^  to  is  of  type  II. 3 

and  at  each  step  the  conditions  are  enabled.  So  the  history  c>  0,... o  T, 

0  1  n  1 

is  Ujj-compatible.  Also,  it  can  be  easily  verified  that  (x^.CS^,  (x^)  , 

m 

T.,  ,  €  SWI_  and  that  T.,  ,  =  lcw_  (x,  ),  for  l<k^m.  By  choice  of 

’<V  T» 

the  xVs,  and  the  fact  that  L  =  L  we  get  that  W  lx.  )  <  W.  lx 

*  Tm  °n  3lV  *  Lt  1 

m 

l<k<m  (see  (3.8.1)).  Therefore  in  T  the  conditions  for  II. 4 

xn 

transitions  are  enabled  for  and  the  x^’s.  So>  we  *et 


S_  (y) 
m+k 


m+k-1 


if  Y 


\  (V*k])  if  y  =  Xk 

T m+k-1 


m+k-1 
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=  L 


for  l<k^m 


m+k 


m+k-1 


It  can  be  seen  that  o„o, . . .T-T-. . .T„  is  u__ -compatible ,  that 

0  1  12  2  m  II 


lew 


WilXkl  *"d  %  lXk’  =  %  ‘V* 


2m  ‘2m  ‘ 2m 

for  l<k^m.  Since  the  x^'s  were  the  only  data  items  for  which  the 

II. 5  conditions  were  violated  on  0  and  are  no  longer  violated  on  T 


2m' 


we  can  define  a  type  II. 5  transition  from  T_  to  T_  ,  so  that 

2m  2m+l 


T  T  ' 

2m+l  2m 

SWI  *  SWI  ,  and 

2m+l  2m 


L  ■  L  «C. 

T2m+1  T2m  1 


Let  8  =  T  T  ...T  .  We  have  proved  that  a6  is  L  -compatible  and 

X  2.  ZltH'  1  XX 

since  L  =  L  we  also  have  that  L  *  L  ,  finally  proving 

.2m  n  T2m+1  n 

that  WFA2  holds  in  this  case,  too. 


3.4  Algorithm  III  (no  undo  but  redo) 

ALGORITHM  3.9.  O'  €  (O)  iff  one  of  the  following  is  the  case: 

III . 1 :  [Submit  a  read  operation] 

Conditions : 

3T.3x[C.  ,A.  ,R.  [x]  ?0PT  ] 

o 

Trans formation  rules: 

S  .  =  S 
O'  o 

SWI  ,  *  SWI 
o'  o 


L_ .  *=  LoR.  [x] 
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III . 2:  [Submit  a  write  operation] 
Conditions : 

3T  3x [C. ,  A. ,  w  [x]  9  OP  ] 

i  1  1  i  Lq 

Transformation  rules: 


S 


o' 


s 


o 


SWI, 


*  swi0 

Wx] 


III . 3:  [Record  an  uncommitted  update  to  "audit  trail" — create  "intention 
list"*] 

Conditions : 

3T.3x[C. ,A,  g  OP  A  W.  [x]  €  OP  ] 
i  l  l  L  i  Xi 

o  a 

Transformation  rules: 


S 


O' 


s 


o 


SWI  =  SWI  U  {  (x,Mt  (W .  [x] )  ,T .  )  } 
O'  O  L_  1  1 


L,  “  L 

0 '  O 


III . 4 :  [Commit  a  transaction  after  its  "intention  list"  is  in  the  audit  trail] 
Conditions : 


3T .  [C.  ( A . 

ill 


eop. 


A  VW.  [x]  e  OP  [  (x  ,M 
1 


(W.  [x]  )  ,T.  )  e  SKI  1  ] 

"0  i  1 


^In  the  earliest  descripoion  of  a  class  III  algorithm  (that  we  know  of) 

[LS76J  the  data  structure  used  in  the  audit  trail  is  called  the  "intention 
list".  The  term  derives  from  the  fact  that  a  transaction  creates  a  list  of 
the  updates  it  "intends"  to  perform,  writes  this  list  in  stable  storage 
(the  "audit  trail")  and  then  commits  later  the  intentions  list  is  carried 
out  and  the  updates  recorded  in  the  materialized  database. 


Transformation  rules: 


111. 


III. 6 


so'  =  so 


SWI  ,  =  SWI 
O'  0 


L  ,  =  L  eC. 

0  0  l 


:  [Move  updates  of  a  committed  transaction  from  the 
to  the  "materialized  database",  one  at  a  time — i. 
out  "intentions  list"] 

Conditions : 

3T  3x[T  =  lew  (x)  A  (x,MT  (W.  [x] )  ,T.  )  £  SWI  ] 
11  U  1  1  o 

Transformation  rules: 

(S0(y)  if  y*x 

S0'(y)  = 

I M  (W. [x] )  if  y  =  x 
Lo  1 

SWI.  =  SWI 
0  o 


:  [Delete  "intentions  list"  of  transaction  T.  once 

l 

carried  out  or  T^  aborted] 


Conditions : 


3T.  [  (C.  €  OP  aVW.Ix]€OPt  S  (x)  =  M  (W  [x] )  ) 
1  1  L  l  L  0  L  l 

O  0  0 


Transformation  rules: 


{  (x ,M  (W  [x])  ,T.  )  |x  €  D} 

Li  1  1 

o 


audit  trail" 

.  start  carrying 


it  has  been 

v  A.  £  OP  ] 
1  L 

0 


III. 7:  [Abort  a  transaction) 


Conditions: 

3T.(ci,Ai«0PL  ] 

o 

Transformation  rules: 


S  .  =  S„ 
O'  0 


SWI  ,  *  SWI 

o  o 


L  .  =  L  oA. 

a  01 


□ 


3.9 


theorem  3.10.  Algorithm  y  is  correct. 

Proof.  Let  Oo°l'’'0n  ^  8  history.  We’ll  show, 

by  induction  on  k,  that  o^  is  resilient  for  O^k^n. 

Basis:  0Q  is  resilient,  by  Lemma  2.4. 

Induction  Step:  Suppose  is  resilient  for  some  0<k<n.  By  Ujjj- 

compatibility  of  o  o  ...0  we  have  that  0  € UTTT •  Consider  the 

type  of  transition  from  0,  to  0,  , . 

k  k+1 

Types  III.  1-3  and  7:  In  all  these  cases  we  have  that  lcwo,  =  lcw0, 

CS  .  =  CS  .  S„.  =  S„  and  SWI  .  =>  SWI  .  Thus,  if  (Rl)  or  (R2)  holds 

o  O  0  0  O’  —  0 

for  some  x€d  in  o,  it  also  holds  for  x  in  0,  , .  By  induction 

k  k+1 

hypothesis  then  is  resilient  in  this  case. 

Type  III. 4;  Let  T^  be  the  transaction  that  commits  in  the  transition 

from  0,  to  0,  , ;  i.e.  L  =  L  «C.  .  Consider  any  x  €  D. 

_k  k'>1  Vi  °k  1 

Case  1 :  lew  (x)  ^T.  .  Then  lew  (x)  =  lew  (x)  ,  CS  (x)  =  CS  (x) 
k+1  1  k+1  k  k+1  k 

and  if  (Rl)  or  (R2)  holds  for  x  in  0  it  also  holds  for  x  in  0 
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(for  (Rl)  note  that  S„  *  S„  and  for  (R2)  that  SWI  =  SWI  , 

°k+i  °k  Vi  °k 

by  the  III. 4  transformation  rules).  By  induction  hypothesis  then,  0^ 
is  resilient  for  x,  in  this  case 


Case  2:  lcw_  (x)  =  T. .  Then  there  must  exist  some  W. [x]  €  OP. 

-  °k.i  1  1  V 

k 

and  by  the  III. 4  conditions,  (x,M  (W.  lx]),T.)  € SWI  .  Since 

L  1  1  o. 

°k  k 

lew  (x)  =  T.,  CS  (x)  =  M  (W.[x])  =  M  (W.[x]).  By  the  III.  4 

k+l  1  k+1  °k+l  1  °k  1 

transformation  rules,  SWI  =  SWI  and  therefore,  (x,CS  (x) , 

k+l  k  °k+l 

lcw„  (x) )  € SWI  establishing  that  (R2)  holds  for  x  in  0,  , .  Thus 

°k+l  °k+l  k+1 

C,  ,  is  resilient  for  x  in  this  case,  as  well. 
k+1 

Type  III.  5:  Let  T^x  be  such  that  S0  (x)  =  y  (W.Ix))  ?  S  (x)  . 

1  k+1  L0k  1  °k 

Consider  any  y  €  D. 


Case  1:  y  /  x.  Then  S  (y)  =  S  (y)  and  SWI  =  SWI  '  by  the 

k+1  °k  °k+l  k 

III. 5  transformation  rules.  Since  also  lcw_  =  lcw„  and 

o  o 

k+1  k 

CS  =  CS,,  .  it  follows  that  if  (Rl)  or  (R2)  holds  for  y  on  0,  it 
oo  k 

k+l  k 

also  holds  for  y  in  BY  induction  hypothesis  then,  is 

resilient  for  y,  in  this  case. 


Case  2:  y  =  x.  Then,  by  the  III. 5  conditions,  T.  =  lew  (x)  and 

1  k 

(x,M  (W.  [x] )  ,T.  )  €  SWI  .  Since  lew  =  lew  ,  CS  (x)  =  M  (W  lx)) 

1  1  0,  0,  .  0,0.  ,  L*  I 

0  k  k+1  k  k+1  °k+l 

=  M,  (W. [x] )  and  SWI  =  SWI  ,  we  have  that  (x,CS„  (x) , 

V  1  Vi  °k  °k*i 

k 

lew  — i (x) )  € SWI  establishing  that  (R2)  holds  for  x  in  0  . 

°k+l  °k+l 

Thus  0,  . ,  is  resilient  for  x  =  y  in  this  case,  also. 


Type  III. 6:  Let  T^  be  such  that  for  all  VT  [x]  €  OP  , 


(x,*^  ^[x])  ,Ti)  €SWla  -SWIo  .  It  is  clear  that  if  (Rl)  holds  for 


k+1 


any  x  €  D  in  0  it  also  holds  for  x  in  0  ,  since  CS  «=  CS 

K  k+1  °k+l  °k 


and  S  =  S  .  So,  suppose  that  (R2)  holds  for  x  in  0  .  That  is, 
k+1  k  * 


(x,CS  (x)  ,  lew  (x) )  €  SWI  .  If  T.  y  lew  (x)  then  (x,CS  (x) , 
k  k  °k  1  °k  °k 


lew  (x))  €SWI  and  since  CS  =  CS  and  lew  =  lew  ,  (R2) 

k  k+1  k+1  k  °k+l  °k 


holds  for  x  in  o  ,  also.  If  T,.  =  lcwn  (x) ,  h.  g 0PT  and  by  the 


i  L_ 


III. 6  conditions,  we  must  have  that  S„  (x)  *  M.  (W. [x])  =  CS  (x) 

0.  L„  1  O, 

k  ok  k 


Because  S 


-  S  and  CS  -  CS  ,  we  get  S  (x)  -  CS  (x) 
k+1  k  k+1  k  k+1  °k+l 


and  thus  (Rl)  holds  for  x  in  °k+^*  Therefore,  in  either  case 


(Ti  =  lcwo  (x)  or  I\  f  lcw^  (x) )  0k+1  is  resilient  for  x. 

k  k 


This  completes  the  induction  step  and  thereby  the  proof. 


3.10 


theorem  3.11.  Algorithm  y  never  requires  undo  but  may  require 


redo. 


Proof.  Let  °o°i‘’'0n  he  any  Ujjj-compatible  history.  Consider 


any  x€D  and  let  S  (x)  *  M  (W.  [x]).  To  prove  that  y  never 

O  X  XXX 

n  O 

requires  undo,  it  suffices  to  show  that  C.  €  OP  .  Indeed,  let  j<n 

1 

n 


be  such  that  S_  (x)  ^  M.  (W.  [x])  and  for  j<k<n,  S_  (x)  =  M_  (W.  [x] ) 
O  .  _  L_  1  0.  L  1 

1-1  0  k  0 

J  n  n 


If  no  such  i  exists,  S„  (x)  =  M.  (W.  [x])  and  Cn €  OP.  ,  and  we  are 

0  L  0  0  L 

no  0 

n  n 


dont.  By  inspection  of  the  transition  rules  of  we  have  that  the 


transition  from  o_.  ^  to  cam  only  be  of  type  III. 5  (it's  the  only 


type  in-  which  the  S-component  of  the  system  state  changes).  By  the  III. 5 


conditions  T.  “lew  (x)  and  thus  C.  €  OP.  .  Surely  then  C.  €  OP. 

i  O .  ,  l  L  l  L 

i-l  0  .  0 

»  j  3-1  n 

) ,  as  required. 


(since  L  3  L 


To  see  that  y 


may  require  redo,  consider  the  history 


OpO^O^O^  defined  as  follows  : 


(Op  is  the  initial  system  state) 


S  =  S  ,  SWI  =  SWI  ,  L  =  L  .W,!x) 
°X  °0  °1  °0  °1  °u  1 


!o  ■  so  •  SWIo  ’  smc  u{(*'ml  W1I,II'V1'  uo  ■  h 
2  1  2  1  0,  2  1 


Sn  =  S  ,  SWI  =  SWI  ,  L  =  L  «C. 

°3  °2  °3  °2  °3  °2  1 

It  can  be  readily  verified  that  this  is  a  y^j-compatible  history  (Op 

goes  to  o^  by  a  III. 2  transition,  O^  to  0 2  by  a  III. 3  transition 

and  o  to  0  by  a  III. 4  transition),  that  S  (x)  =  g  (  )  =  M  (w  lx]) 

c  i  0  _  10  L  0 

3  3 

CQ €  OP^  ,  yet  lcwo  (x)  =  T1  /  TQ.  °o0i°203  shows  that  y^j  may  require 
°3  3 

redo.  D„  , . 


THEOREM  3.12.  Algorithm  y  is  well-formed. 

Proof.  By  transition  types  111.1,2  and  7  it  is  immediate  that  y^ 

satisfies  WFA1.  We  show  that  it  also  satisfies  WFA2.  Let  a  -  a  <3 ...  .0 

0  1  n 

be  any  y^^- compatible  history.  Consider  any  T^  such  that  C^, 

n 

Case  Is  If  for  all  W_.  [x]  €  OP,  ,  (x,M.  (fc\  [xj ) ,T, )  € SWI_  ,  the  III. 4 

-  l  1-*  L  l  l  o 

oo  n 

conditions  are  satisfied  and  we  may  define  such  that 

L  =  L  °C. .  Since  0  0,...0  T,  is  y„_  - compatible  we  have  that  y,_,. 
T,  0  1  0  1  n  1  III  III 

1  n 

satisfies  WFA2 ,  in  this  case. 


Case  2:  Suppose  for  some  W.lxj£0P.  ,  (x,M,  (W.  Ix) )  ,t. )  ?  SWI„  (3.12.1). 

-  '  A  L_  L_  1  X 

0  0  n 

n  n 

Let  x. , . . . ,x  be  all  such  x  €  D.  Since  C. ,  A.  £  OP,  and 
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W,  [x]  €  OP.  ,  the  III. 3  conditions  are  enabled  for  T.  and  any  x,  l<k<] 

1  L0  1  * 

,n 

so  we  may  define  T^,  0<k<m  as  follows: 

•tie,  and 

u  n 

ST  «  S 

k+1  k 

SWI  -  SWI  u{(x.,Mt  (W.  lx  ])  ,T.)},  and 
Tk+1  Tk  k  Lo,  1  *  1 


L  =  L  , 

T  T 

k+1  lk 


for  0<k<i 


Since  x  x  were  all  the  data  items  x€D  such  that  (3.12.1)  is  true 

1  m 

and  clearly,  ,  A^  $  OP^  and  (x^^  (Wi  )  ,Ti>  €  SWIT  ,  the  III. 4 

T  T  1  1  Tm 

m  m 

conditions  are  satisfied  in  T  and  we  may  thus  define  T  .  €  u  (T  )  by 

m  -1  m+1  LS  m 

a  III. 4  transition.  If  8  =  T  T  ...T  ,  we  have  that  aB  is  vi,  -compatible 

i  z  m+ j.  hi 

and  L^  *  Lq  o^.  Thus  Vjjj  satisfies  WFA2  in  this  case  also. 
m+1  n 


3 . 5  Algorithm  IV  (neither  undo  nor  redo) 


ALGORITHM  3.13.  O'  6UIV(0)  if  one  of  the  following  is  the  case: 


IV. 1:  [Submit  a  read  operation] 


Conditions: 


3T.3xIC, ,A. ,R. [x]  ? 0PT  ] 

A  A  A  A  L» 

o 


Transformation  rules: 


S  .  *  S 
O'  0 


SWI  ,  -  SWI 
G  0 


V  •  Vi 1x1 
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SWI„,  -  SWI 
O'  0 


L_ .  -  L  oC. 
O'  0  1 


IV. 5 :  [Delete  unnecessary  information  from  the  "audit  trail"] 
Conditions : 


3T.  [C.  €  OP,  v  A.  €  OP,  ] 
ii  L  i  L 
0  0 


Transition  rules: 


S  ,  -  S 
O'  o 


SKI  ,  *  SW1  -  { (x,v,T. )  lx  €  D.v €  HU} 
0  0  1  1 


IV. 6: 


V  *  L0 


[Abort  transaction  T^] 


Conditions : 


3T.  [C.  ?OPT  ] 

1  1  Lo 


Transformation  rules; 


So'  *  So 


swi  .  *  swi 
o'  o 


V  =  VAi 


theorem  3.14.  Algorithm  yiv  is  correct. 


Proof.  Let  o_0,...0  be  a  u  - compatible  history.  We'll  show,  by 
-  0  1  n  IV 

induction,  that  o  is  resilient  for  0<i<n.  In  fact,  we'll  show  some- 

J.  i 

thing  even  stronger,  namely  that  each  0^  satisfies  condition  (Rl)  for  all 


x€  D. 
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Basis :  By  the  proof  of  Lemma  2.4,  0^  satisfies  (Rl)  . 

Induction  Step:  Let  0<k<n  and  suppose  0^  satisfies  (Rl) .  Since 

°0°1  —  °n  iS  1Jiv_COmpatib','J  we  know  °k+l  ^  yiV^°k^  ’  WeI11  show  that 
satisfies  (Rl)  by  considering  each  type  of  transition  defined  by  y  . 


Types  IV. 1-3,  IV. 5-6:  In  all  these  cases,  L  jo  L  ,  and  the  same 

°k+l  °k 

transactions  are  committed  in  both  logs  (Com(L  )  *  Com(L  )).  Therefore 

o  o 

k+1  k 

CS  =  CS  .  Also,  in  all  these  cases  S  =  S„  .  By  inductive 
no  no 

k+l  k  k+1  k 


hypothesis,  S  =  CS  .  Hence,  S  =  CS 

°k  k  °k+l  °k+l 


as  wanted. 


Type  IV. 4:  Let  T^  be  the  transaction  that  was  committed  in  the  transition 
from  0,  to  0,  , ;  i.e.  T.  is  such  that  L_  =  L  ®C..  Consider  any 

k  ktl  1  Vi  °k  1 

x €  D.  We  want  to  show  that  S  (x)  *  CS  (x) .  We  distinguish  two  cases 

on 
k+1  k+1 

Case  1:  S  (x)  ?S  (x) .  By  the  transformation  rules  we  have  that 
k+1  k 

S  (x)  =  M  (w . Ix] )  (3.14.1).  Also,  we  claim  that  T.  =  lew  (x) . 

0,  ,  L  l  a  o,  , 

k+1  0,  k+1 

k 

For,  take  any  T^ ,  SL  f  i ,  such  that  C^,  W^  lx)  6  OP^  .  Obviously,  by  the 

Qk+1 

transformatioi.  rules,  C0 ,  W„ lx]  £  OP  .  Let  T.  *  lew  (x) .  By  definition 

*  *  V  J  °k 

k 


of  lew,  either  j  *  £ ,  or  W0 lx]  <  W.[x].  Because  <  extends 

*  L0  3  Lo 

k  k+1 

<  (L  c  l  )  and  by  the  transformation  rules  W.[x]  <  W. Ix]  we 

°k  k  k  1  k 

have,  by  transitivity  of  <  ,  that  W. Ix)  <  W.lx).  Moreover, 

L  k  L  X 

°k+l  °k+l 

C.  €  OP,  and  by  definition  of  lew,  lew  (x)  =  T. .  But  then, 

1 

CS  (x)  *  M  (w.lx])  (3.14.2).  From  (3.14.1),  (3.14.2)  and  the 

°k+1  .  1 

obvious  fact  that  M  (W. [x])  =M  (W. Ix])  we  conclude  that 

o  1  Lo  1 

k  k+1 

S_  (x)  *  CS_  (x) ,  as  desired. 


Case  2; 


S  (x)  "  S  (x)  (3.14.3).  We  claim  that  in  this  case  T.  y 

.  °k+l  °k  1 


lew  (x) .  This  is  because,  by  the  transformation  rules  this  case  will 


k+l 


happen  if  either  W  lx)  gOPT  and  hence  W.  lx]  g  0PT  ,  or  W.  lx]  <  W.  lx] 

1  Lrr  1  1  *■_  3 


k+l 


and  hence  W  lx]  <  W  lx]  where  T.  *  lew  (x) .  In  both  cases,  clearly 

o  3  3  k 

k+l  K 


T.  y  lew  (x) .  But  then,  since  L  3  L  ,  lcw„  (x)  *  lcw_  (x)  and 
1  °k*i  Vi  -  °k  °k.i  °k 

CS  (x)  *  CS  (x)  (3.14.4).  By  induction  hypothesis,  S  (x)  *  CS  (x) . 

k+l  k  °k  k 

This  with  (3.14.3)  and  (3.14.4)  yield  the  desired  S  (x)  =  CS  (x) . 

on 
k+l  k+l 

This  concludes  the  induction  step  and  thereby  the  proof.  ^ 


theorem  3.15.  Algorithm  y  never  requires  redo  or  undo. 

Proof.  In  3.14  we  actually  proved  that  for  all  y. -compatible 

X# 

histories  o  a  ...a  ,  for  all  0<k<n  all  x£D,  S„  (x)  *  CS^  (x)  . 

0  1  n  °k  °k 

Definitions  2.6  and  2.7  then,  immediately  yield  the  theorem.  ^ 

theorem  3.16.  Algorithm  yJV  is  well-formed. 


Proof.  WFAl  is  immediate  from  transition  types  IV. 1,2  and  6. 


So  we  only  need  to  show  that  UIV  satisfies  WFA2 .  Let 

a  *0„a,...0  be  any  y_,  -compatible  history  and  consider  any  T.  such  that 
01  n  iv  i 

C.  ,  A  COP,  . 


n 

Case  1:  If  it  is  also  the  case  that  for  all  W.  lx]  €  OP  , 

-  1  Lo 

n 

(x,M  (W.lx]),T.)  €SWI0  the  conditions  of  type  IV. 4  transitions  are 
L0  1  1  n 


n 

enabled  and  so  we  may  define  suc^  that  T ^  is  obtained  from 

o  by  the  transformation  rules  of  type  IV. 4  transitions.  By  defining 
n 

8*1^,  we  get  that  a 8  is  yiv~compatible  and  L^  *  «C^,  with  Ci,Ai£0P 

1  n 


verifying  that  y 


satisfies  WFA2,  in  this  case. 


4. 


DISTRIBUTED  DATABASE  SYSTEMS 


4.1  Introduction 

In  this  section  we  extend  our  model  to  describe  certain  aspects  of 
reliable  processing  of  operations  in  the  context  of  a  distributed  database 
system.  Speaking  in  broad  terms,  a  distributed  database  system  consists  of 
a  number  of  independent  processors,  called  sites,  each  of  which  stores  one 
or  more  data  items.  Transactions  wishing  to  access  the  data  items  submit 
operations  to  the  appropriate  sites  (usually  through  a  scheduler) .  These 
operations  are  processed  locally  (independently  of  what  happens  at  other 
sites) .  Eventually  each  transaction  must  be  terminated  by  either  becoming 
committed,  or  aborting.  This  decision  is  reached  by  consensus  of  the  sites 
at  which  the  transaction  submitted  operations:  if  all  these  sites  agree 
to  commit  it,  the  transaction  is  globally  committed;  if  even  one  site 
dissents,  the  transaction  must  be  globally  aborted. 

This  consensus  voting  scheme  is  called  atomic  commitment  and  an  algo¬ 
rithm  that  implements  it  is  called  an  atomic  commitment  algorithm  (protocol) . 
Several  such  algorithms  have  been  proposed,  for  example  two-phase  commit 
(ILS76),  [G78] ) ,  three-phase  commit  (IS82]),  the  SDD-1  (four-phase)  commit 
algorithm  { [HS80 ] )  among  others. 

Atomic  commitment  is  a  coordination  problem  and,  therefore,  inherently 
involves  communication  among  the  participating  (voting)  sites.  Because 
sites  are  independent  processors  (and  may,  therefore,  fail  independently) 
and  because  the  medium  of  inter-site  communication  is  also  subject  to  a 
variety  of  failures,  there  are  new  kinds  of  fault  tolerance  that  we  might 
like  a  distributed  database  system  to  provide.  Indeed,  the  various  atomic 
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commitment  algorithms  mentioned  above  have  different  properties  of  fault 
tolerance  with  respect  to  these  new  types  of  failures.  Some  interesting 
work  on  this  question  can  be  found  in  IS82J. 

Description  of  atomic  commitment  algorithms  and  proofs  of  their 
fault  tolerance  properties  is  beyond  the  scope  of  the  present  paper.  We 
therefore  deliberately  extend  our  model  in  a  way  that  factors  out  these 
questions.  (Other  extensions,  with  these  questions  in  mind,  are  possible 
and  will  be  sketched  in  the  next  section.) 

Rather,  we  are  interested  in  tinderstanding  how  to  build  reliable 
distributed  algorithms  for  processing  database  operations  out  of  reliable 
centralized  such  algorithms,  as  described  in  Sections  2  and  3,  under  the 
assumptions  that  the  communication  medium  does  not  fail  and  that  failures 
of  sites  are  detected  by  other  sites.  As  we  shall  see,  it  is  possible  to 
build  "heterogeneous"  distributed  algorithms — in  the  sense  that  different 
sites  may  use  different  centralized  algorithms. 

Our  primary  goal  is  the  description  of  the  conditions  under  which  a 

site  should  vote  to  commit  or  abort  a  transaction  in  an  atomic  commitment 

protocol.  This  will  be  done  by  giving  a  predicate  VOTEfo1,^)  where  01 

is  the  (local)  system  state  at  site  i  and  is  a  transaction.  Site  i, 

if  asked  to  vote  on  transaction  T.  while  at  state  01,  votes  to  commit 

k 

iff  VOTE(o1,Tk)  is  true. 

4.2  Extension  of  the  Model 

A  distributed  database  design  is  a  tripe  (t,D,str),  where  t  £  IN  is 
the  number  of  sites,  D  is  the  set  of  data  items  and  str:  D { 1 , . . .  ,t}  is 
a  function  that  maps  each  data  item  to  the  site  at  which  it  is  stored.  Note 
that  making  str  a  (single-valued)  function  commits  us  to  databases  with  no 


data  replication,  as  each  data  item  in  D  is  stored  at  exactly  one  site. 
Throughout  this  section  we  keep  the  distributed  database  design  fixed. 


A  global  system  state,  0,  (for  the  given  distributed  database  design) 


is  a  triple  0  =  (S0,SWI0,L  )  ,  where  the  three  components  have  the  same 


meaning  as  the  components' of  a  system  state  as  defined  in  Section  2, 


denotes  the  set  of  global  system  states. 


A  local  system  state  at  site  i,  a1,  is  a  triple  o1  *  (s  . ,SWI  . ,L  . ) 


o1  o1 


where  the  three  components  are  restricted  to  data  items  stored  at  site  i. 

Thus,  if  we  define  D1  *  {x  €  D|str(x)  *  i},  S  .  :  D1  *►  HU ,  SWI  .  c:  D1  x  HU  *  T 

o  o1 

and  L  .  is  a  log  such  that  if  R,  lx]  €  OP  or  W  lx]  £ 0PT  ,  x £  D1 , 

01  *  oi  k  V 

^  ■* 

I  denotes  the  set  of  local  system  states  at  site  i. 

Given  a  global  system  state  o£I,  we  may  define  a  unique  local  system 

state  01£Z1,  for  every  site  i  as  follows:  S  ^  *  S^D1, 


SWI  =  {  (x,v,T,  )  £SWI  jxfD1},  and  L  .  is  the  restriction  of  the  partial 
_  1  x  o  1 

O  0 


order  to  the  domain  OP^  =  {a^€OPL  |  a^  €  {c^  ,A^  ,1^  lx]  ,W^  [x]  |  x  £  D1 } } . 


0  i  O  t  i  i 

Given  a  vector  of  system  states  (0  )  where  0  £  I  ,  we  may 


t  t 

define  a  global  system  state  o,  as  follows:  S  *  U  S  . ,  SWI  =  b  SWI 


0  i-1  01 


i=l 


and  L  is  any  log  with  domain  OP.  *  U  OP.  such  that  L  z>  L  .  , 
O  L.,L.  O—i 


0  i=l 


for  every  l<i^t.  Note  that  this  does  not  uniquely  define  o,  as  there 
may  be  several  logs  with  the  wanted  property. 


According  to  one  school  of  distributed  database  researchers  it  is  the  job 
of  the  transaction  manager  (not  of  the  scheduler  or  data  manager)  to  ensure 
that,  in  replicated  databases,  all  copies  of  a  data  item  be  kept  consistent 
This  view  is  explicitly  taken,  for  example,  in  [KP81].  As  ours  is 
essentially  a  model  at  the  data  manager  level,  this  view  would  justify  our 
assumption.  We  hope  to  explore  the  complications  of  data  replication  in 
in  another  paper. 
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Throughout  this  section  we  adopt  the  convention  that  local  system 
states  G1  and  global  system  state  0  are  related  as  above 

A  distributed  algorithm  (for  processing  database  operations )  is  a 

I  I 

pair  where  Ap:  I  ■+  2  and  Yp:  1  +  2  .  Intuitively  the  distributed 

algorithm  changes  the  global  system  state  to  a  new  one  by  one  application  of 
either  Ap  or  Yp  to  the  "current"  system  state.  The  reason  we  broke  the 
state  transition  function  into  two  components  is  that  Ap  corresponds  to 
changes  of  the  global  state  incurred  by  local  processing  at  the  sites, 
while  Yp  corresponds  to  changes  of  the  global  state  incurred  by  the 
coordinated  activity  of  aborting  or  committing  a  transaction — i.e.  Yp 
simulates  (in  a  very  crude  way)  the  activity  of  atomic  commitment. 

More  formally,  we  require  of  a  distributed  algorithm  <^p»Yp)  that: 


(DAI) 

If 

o’  € Ap(o)  then  for  all 

transactions 

V  S<£0PI  ,  *-cx£opl. 

0  c 

and 

\C0V.  -Ake°PL  ■ 
0*  0 

and 

(DA2 ) 

If 

O'  €Yp(0)  then  for  some 

transaction 

T,  ,  either  L  =  L  °C, 

k  O'  ok 

or 

> 

o 

JP 

II 

O 

A  global  history  is  a  sequence  0^0^ . . .0  € T*.  A  global  history 

o,o. ...a  is  (AfwYfl) -compatible  iff  a.  ,  €A„(a.)  or  a.  ,  €y O(o.)  for 
1  2  n  V  V  r  l+l  V  l  l+l  V  i 

l<i<n.  For  a  log  L  we  define  hp(L)  =  {a  =  oQo  . .  .0^  |a  is  (Ap,Yp)- 

compatible  and  L  =L  }  .  We  may  define  resilient  global  system  states, 
n 

and  the  concepts  of  the  last  committed  writer  of  a  data  item  x  in  a  global 
system  state  0,  denoted  lcw^x),  and  of  the  committed  database  state  in  a 
global  system  state  o.  denoted  cs0 <  exactly  as  in  Section  2.  We  shall 
not  duplicate  these  definitions  here.^ 

^Note,  however,  that  for  these  definitions  to  be  immediately  transferable  in 
our  new  setting,  it  is  crucial  that  there  be  no  replicated  data  items.  In 
fact,  one  concern  in  studying  data  replication  will  be  to  find  conditions  on 
the  algorithms  such  that  lew  and  CS  are  always  well-defined. 
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4. 3  Building  Reliable  Distributed  Algorithms 

We  are  going  to  build  reliable  distributed  algorithms  on  the  basis  of 
correct  and  well-defined  centralized  algorithms,  as  defined  in  Section  2 
(and  examples  of  which  were  analyzed  in  Section  3).  Intuitively,  each 
site  uses  a  (centralized)  algorithm  to  process  the  data  operations  sub¬ 
mitted  to  it.  Eventually  transactions  must  become  globally  committed  or 
globally  aborted. 

Let  be  the  (centralized)  algorithm  on  which  is  based  the  data 

i 

processing  algorithm  of  site  i.  We  assume  that  is  correct  (see 

i 

Definition  2.5)  and  well-formed  (see  Definition  2.8). 

Let  o1  €  I1  and  be  a  transaction.  We  define,  for  each  site  i, 

a  predicate  VOTE^ ,  as  follows: 

Definition  4.1.  VOTE .  (a1  ,T,  )  *  30 ' 1  6  I1  [0 ' 1  G  U.  (C^JAL  .  «L.®C,J.  a,  , 

1  K  Ai  (j,1  cji  K 

Note  that,  by  the  fact  that  is  well-formed,  for  any  transaction 

i 

T,  there  exist  states  0  that  render  VOTE.  (0  ,T.  )  true. 

JC  1  iv 

The  predicate  VOTE^  is  used  by  site  i  to  decide  how  to  vote  on  a 

transaction  T^  if  it  is  asked  to  participate  in  an  atomic  commitment 

protocol.  If  site  i  is  at  state  01  when  that  happens,  it  votes  to 

commit  T  ,  if  VOTE.lO^T.)  is  true,  and  to  abort  T.  ,  otherwise.  In- 
K  1  K  K 

tuitively,  VOTE.to^T.)  is  true  if  the  centralized  algorithm  L  could 
1  K  A . 

1 

commit  T.  in  the  next  state  transition  from  0  . 
k 

Z  Z 

We  may  now  define  functions  X q-.  Z  -*■  2  and  Yp:  Z  -*■  2  as  follows: 


r.'  . 

»  ■ 

r  V  ‘ 

* 


Definition  4.2.  O'  € Ap(o)  iff 


10  £  “a  to  >  AVTkuckeopL  -ck€0p  )a(^£op  —\£op  mi 

1  O'  o  o'  o 

V  o'1  =  o1 ,  for  1  <i  <t  . 


o'  €  Yp(o)  iff 

t 

o'1  €  u  (a1)  a  3t  I  (  A  vote,  (a1.*)  al  ,  =  Lot) 

A.  k  .  ,  a  k  O'  0  k 

l  1=1 

t 

v  (~  A  VOTE.fO1,^)  ALC,  =La»\)l-  for  all  l<i<k. 
i=l 

°4.2 

Informally,  this  definition  says  that  a  global  system  state  transition 

by  Ap  occurs  if  one  or  more  sites  change  their  local  system  state  due  to 

local  processing.  A  transition  by  Yp  occurs  when  a  transaction  is 

globally  committed  or  aborted.  If  every  site  i  is  at  a  local  system 

state  01  at  which  it  can  commit  transaction  T^  in  the  next  step  by  its 

local  (centralized)  algorithm  (i.e.,  VOTE,  (a1, T,  )  is  true  for  l<i<t), 

1  K 

then  is  globally  committed.  Otherwise  s  globally  aborted. 

Now  we  must  show  that  a  distributed  algorithm  ^p»"Vp)  ,  constructed 
as  above  is  correct.  Let  oq €  I  be  the  initial  global  system  state  and 
Oq  €  I1  be  the  initial  local  system  state  at  site  i.  0^  and  are 

resilient  (see  Lemma  2.4). 

theorem  4.3.  Let  (^p.Yp)  be  constructed  as  in  Definition  4.2  and 
let  o  o  ...on  be  a  -compatible  history.  Then  is  resilient. 

Proof.  We  show,  by  induction  on  i,  that  0^  is  resilient,  0<i<n. 


Basis:  As  we  remarked,  0.  is  resilient. 


Induction  Step:  Suppose  is  resilient  for  some  l<j<n.  We'll  show 

that  o^+1  is  also  reslient.  By  inspection  of  Definition  4.2  it  is  easy 


to  see  that  lew. 


lcwo  ,  CS0  *  CS  and  SWI  *  SWI  ,  except 

j  j+1  j  j+1 


in  case  Oj+^£yp(Oj)  and  LQ  *  L0.*St  ^or  scsne  transaction  T^. 

Except  for  this  case  then,  is  resilient  by  inductive  hypothesis. 

Let’s  now  examine  the  remaining  case.  By  4.2,  we  must  have  that 
VOTE^  (o*  ,T^)  is  true  for  every  l<i<t.  Thus,  by  Definition  4.2  there 
exist  °1+1elJA  (0l)  such  that  L  •  =  L0  °C.  for  l<i<t.  By 

3  1  °5+i  j+i 

assumption  the  V.  's  are  correct  and  since  by  inductive  hypothesis,  o. 

.  i  ^ 

and  hence  the  °j's  are  resilient,  it  follows  that  the  o^^'s  are 

resilient  for  l<i<t.  Hence  a.  ,  is  resilient.  c 

1+1 

We  can  also  prove  the  analogue  of  Theorem  2.13  for  a  distributed 
database.  We  shall  state  the  theorem  in  this  new  setting  but  will  not 


give  the  proof  as  it  is  similar  to  that  of  2.13. 

theorem  4.4.  Let  a  distriluted  algorithm  defined  as  in 

4.2 y  L €  RC,  0-0-...0  €hfl(L)  and  O.T....T  €!»«(*_  ...(D).  Then 

uj.ni/  u  i  m  u  com  ilj 


5. 


FINAL  REMARKS 


In  this  paper  we  developed  a  model  of  a  database  system  for  the  purpose 
of  studying  formally  the  reliable  processing  of  operations  in  such  a  system. 
Our  motivation  for  doing  so  is  that  in  most  of  the  work  on  database 
reliability  we  are  familiar  with,  notions  such  as  "correct  algorithm", 
"resilient  database",  "transaction  redo",  "transaction  undo"  etc.  are  used 
loosely  and  defined  in  the  limited  context  of  particular  database  systems 
under  development. 

We  were  able  to  use  our  model  to  define  such  concepts  formally  in  a 
way  that,  we  hope,  makes  intuitive  sense.  We  believe  that  our  model 
provides  a  reasonable  abstraction  in  which  algorithms  for  processing 
database  operations  can  be  described  and  their  reliability  properties 
studied.  Partial  evidence  supporting  our  belief  can  be  found  in  Section  3 
of  the  present  paper,  where  we  state  four  fundamental  algorithms  and 
rigorously  prove  their  reliability-related  properties. 

Lest  we  be  mistaken  for  proposing  our  model  as  the  database  reliability 
panacea  let  us  point  out  that  it  is  ineffective — if  not  misleading — for 
analyzing  the  performance  of  the  algorithms  that  can  be  described  in  its 
terms.  For  example,  from  the  redo/undo  properties  of  the  four  algorithms 
described  in  Section  3  one  might  fallaciously  conclude  that  algorithm  IV 
has  superior  performance  than  either  algorithm  II  or  algorithm  III,  since 
algorithm  IV  never  requires  either  redo  or  undo  while  the  other  two  require 
one  of  them.  By  the  same  token  one  might  also  fallaciously  conclude  that 
algorithms  II  and  III  are  both  superior  to  algorithm  I,  which  requires 


both  redo  And  undo. 


The  reason  such  conclusions  are  unwarranted  is  that  the  data  structures 


used  in  known  implementations  of  algorithm  I  ( [Lo77] ,  [V78] ,  [HR79J,  IBGH82]) 
may  incur  significant  overhead.  For  example,  the  one  implementation  that 
is  known  to  achieve  the  properties  of  Lorie's  algorithm  tends, to  cause 
physical  separation  of  logically  related  disk  pages,  thus  adversely 
affecting  disk  seek  time.  Our  point  is  that  models  much  more  detailed 
than  the  one  developed  in  this  paper  seem  to  be  necessary  before  we  are 
able  to  substantiate  claims  concerning  the  performance  properties  of 
different  algorithms  for  processing  database  operations. 

One  direction  in  which  our  model  can  be  extended  is  toward  replicated 
distributed  databases.  One  would  hopefully  be  able  .to  define  the 
properties  that  distributed  algorithms  for  processing  transactions  must 
satisfy  in  order  for  the  different  copies  of  each  data  item  to  be  con¬ 
sistent.  One  would  also  hope  to  relate  such  properties  to  log-theoretic 
properties  of  data  replication,  as  in  [ABG82 ]  and  [BG82]. 

Another  hopeful  direction  is  to  extend  our  model  for  the  purposes  of 
studying  atomic  commitment  algorithms.  The  idea  is  to  equip  local  system 
states  with  two  more  components:  one  for  messages  to  be  delivered  to 
other  sites  and  another  for  messages  received.  There  will  be  "local 
transitions"  which  can  modify  the  database  state,  S,  the  stable  write 
information,  SWI ,  and  the  log,  L,  components  of  a  local  system  state;  and 
"global  (or  network)  transitions"  which  "deliver"  messages  from  one  site 
to  another  by  modifying  the  two  new  components.  We  must  then  define 
various  algorithms  as  transition  functions  of  this  kind  and  also  different 
classes  of  faults.  One  would  hope  that  this  will  lead  to  rigorous  proofs 
of  resiliency  properties  of  various  atomic  commitment  algorithms  (described 


> 


as  state  transition  functions)  with  respect  to  the  different  classes  of 
faults.  At  the  time  of  this  writing,  however,  we  have  not  yet  experimented 
with  this  idea  enough  to  have  concluded  whether  or  not  it  is  a  better  way 
for  studying  atomic  commitment  protocols  than  other  formalisms,  such  as 
the  finite  state  automata-based  formalism  of  Skeen  [S 82 ) . 


6 


REFERENCES 


(ABG82 ] 

[BG82 ] 

[BGH82] 

[D82] 

IG78] 

[H82  ] 

IHR79] 

IHR82] 

[HS80] 

[KP81] 


R.  Attar,  P.A.  Bernstein,  and  N.  Goodman,  "Site  initialization, 
recovery  and  backup  in  a  distributed  database  system," 
Proceedings  of  the  1982  Berkeley  Workshop  on  Distributed  Data¬ 
bases  and  Computer  Networks,  1982  (also  issued  as  Aiken 
Computation  Laboratory  TR-13-81,  Harvard  University). 

P.A.  Bernstein,  and  N.  Goodman,  "Multiversion  concurrency 
control- theory  and  algorithms,"  Proceedings  of  the  ACM  SIGACT- 
SIGOPS  Conference  on  Principles  of  Distributed  Computation, 
August  1982,  Ottawa  (also  issued  as  Aiken  Computation  Laboratory 
TR-20-82 ,  Harvard  University) . 

P.A.  Bernstein,  N.  Goodman,  and  V.  Hadzilacos,  "Recovery  algo¬ 
rithms  for  database  systems,"  to  appear.  Proceedings  of  1983 
IFIP  Conference . 

D.J.  Dubourdieu,  "Implementation  of  distributed  transactions," 
Proceedings  of  the  1982  Berkeley  Workshop  on  Distributed  Data 
Management  and  Computer  Networks,  1982,  pp.  81-94. 

J.N.  Gray,  "Notes  on  database  operating  systems,"  in  Operating 
Systems:  An  Advanced  Course,  Springer-Verlag,  1978. 

V.  Hadzilacos,  "Formalizing  recovery  in  database  systems," 
unpublished  memorandum,  Aiken  Computation  Laboratory,  Harvard 
University,  May  1982. 

T.  Harder,  and  A.  Reuter,  "Optimization  of  logging  and  recovery 
in  a  database  system,"  in  Database  Architecture ,  Bacchi  and 
Nijssen  (eds.),  North-Holland,  1979,  pp.  151-168. 

T.  Harder,  and  A.  Reuter,  "Principles  of  transaction  oriented 
database  recovery — a  taxonomy,"  Universitat  Kaiserslautern,  FRG, 
Technical  Report  50/82,  April  1982. 

M.  Hammer,  and  D.W.  Shipman,  "Reliability  mechanism  for  SDD-1: 

A  system  for  distributed  databases,"  ACM  Transactions  on  Data¬ 
base  Systems  ,5:4,  December  1980,  pp.  431-466. 

P.C.  Kanellakis,  and  C.H.  Papadimitriou ,  "The  complexity  of 
distributed  concurrency  control,"  Proceedings  of  the  IEEE  22nd 
Annual  Symposium  on  Foundations  of  Computer  Science,  October 
1981,  pp.  185-197. 

B.G.  Lindsay,  et  al.  ,  "Notes  on  distributed  databases,"  IBM 
Research  Report,  No.  RJ2571,  July  1979. 


(Li79J 


[Lo77]  R.A.  Lorie,  "Physical  integrity  in  a  large  segmented  database," 

ACM  Trcuisactions  on  Database  Systems  2_:1,  March  1977,  pp.  91-104. 

[LS76]  B.  Lampson,  and  H.  Sturgis,  "Crash  recovery  in  a  distributed  data 
storage  system,"  unpublished  memorandum,  Xerox  PARC,  1976. 

[Ly82]  N.A.  Lynch,  "Concurrency  control  for  resilient  nested  trans¬ 
actions,"  unpublished  manuscript.  Laboratory  for  Computer 
Science,  MIT,  May  1982. 

[ML80 ]  D. A.  Merasce,  and  O.E.  Landes,  "On  the  design  of  a  reliable 

storage  component  for  distributed  database  management  systems," 
Proceedings  of  the  Conference  on  Very  Large  Database  (VLDB) , 
Montreal,  1980,  pp.  365-375. 

[R75]  R.L.  Rappaport,  "File  structure  design  to  facilitate  on-line 

instantaneous  updating,"  Proceedings  of  the  1975  SIGMOC 
Conference ,  pp.  1-14. 

[581]  D.  Skeen,  "Nonblocking  commit  protocols,”  Proceedings  of  the  1961 
ACM-SIGMOD  International  Conference  on  Management  of  Data,  April 
1981,  pp.  133-142. 

[582]  D.  Skeen,  "Crash  recovery  in  a  distributed  data  base  system," 
Ph.D.  Dissertation,  Department  of  Electrical  Engineering  and 
Computer  Science,  University  of  California  at  Berkeley,  1982. 

[V78]  J.S.M.  Verhofstad,  "Recovery  techniques  for  database  systems," 

ACM  Ccmruting  Surveys  10:2,  June  1978,  pp.  167-195. 


SECTION  VIII 


SITE  INITIALIZATION,  RECOVERY  AND  BACK-UP 
IN  A  DISTRIBUTED  DATABASE  SYSTEM 

Rony  Attar 
Philip  A.  Bernstein 
Nathan  Goodman 


‘Published  in  the  Proceedings  of  the  Sixth  Berkeley  Workshop  on 
Distributed  Data  Management  and  Computer  Networks,  Asilomar, 
Feb.  1982. 


ABSTRACT 


Site  initialization  ia  the  problem  of  integrating  a  new  site  into  a 
running  distributed  database  system  (DOBS) .  Site  recovery  is  the  prob¬ 
lem  of  integrating  an  old  site  into  a  DDBS  when  the  site  recovers  from 
failure.  Site  backup  is  the  problem  of  creating  a  static  backup  copy  of 
a  database  for  archival  or  query  purposes.  We  present  an  algorithm  that 
solves  the  site  initialization  problem.  By  modifying  the  algorithm 
slightly,  we  get  solutions  to  the  other  two  problems  as  well. 

Our  algorithm  exploits  the  fact  that  a  correct  DDBS  must  run  a  seriali¬ 
zable  concurrency  control  algorithm.  Our  algorithm  relies  on  the  concurrency 
control  algorithm  to  handle  all  inter-site  synchronization. 
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1.  THE  SITE  INITIALIZATION  PROBLEM 

Sits  initialisation  is  the  problem  of  Integrating  a  new  site  into  a 
distributed  database  system  (DOBS) .  The  goal  is  to  make  the  new  site  look 
like  all  other  sites.  In  particular,  transactions  must  be  able  to  access 
data  at  the  new  site  in  the  sans  way  as  they  access  data  at  all  other  sites. 

The  main  problem  is  to  bring  the  database  at  the  new  site  up-to-date  relative 
to  the  rest  of  the  system.  The  problem  is  caused  by  replicated  data:  if  the 
new  site  stores  datum  X  and  there  are  copies  of  X  elsewhere  in  the  system, 
then  the  value  of  X  at  the  new  site  must  agree  with  its  value  in  the  rest 
of  the  system.  There  is  a  simple  brute  force  solution  to  the  problem:  just 
turn  off  the  DDBS,  wait  for  all  activity  to  subside,  and  then  load  the  new 
site's  database  in  bulk.  Our  solution  is  almost  as  single  as  this,  but  far 
more  practical. 

Our  algorithm  exploits  the  fact  that  a  correct  DDBS  must  run  a  serializable 
concurrence  control  algorithm  (cf.  [BG] ) .  Concurrency  control  is  the  activity 
of  coordinating  transactions  that  access  a  database  concurrently.  The  goal  is 
to  prevent  concurrent  transactions  from  interfering  with  each  other.  This  goal 
is  usually  formalized  by  the  concept  of  serializability  (e.g.  [BSW,  EGLT,  Pa, 
SLR,  Th]):  an  execution  of  transactions  is  serialisable  if  it  is  equivalent  to 
an  execution  in  which  transactions  execute  serially,  one  after  the  other  with 
no  concurrency.  Many  algorithms  are  known  for  attaining  this  goal,  e.g. 
two  phase  locking  and  timeetairp  ordering. 

he  we  will  see ,  the  site  initialization  problem  can  be  neatly  framed  in 
terms  of  serializable  executions.  Once  stated  in  these  terms,  a  simple 
solution  will  stare  us  in  the  face.  All  we  i.ave  to  do  is: 

(1)  turn  on  the  concurrency  control  algorithm  at  the  new  site; 

(2)  tell  all  other  sites  to  begin  updating  the  replicated  data  at 


the  new  site ;  and 
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(3)  not  lot  any  transaction  road  a  datum  X  at  the  new  site  until  x 
has  been  updated  at  least  once. 

These  three  steps  are  a  sketch  of  our  algorithm.  The  rest  of  the  paper 
fills  in  the  details,  and  ejg>lains  why  the  algorithm  works.  We  also  show 
how  to  use  the  algorithm  to  solve  site  recovery  and  backup  problems. 


2.  BASIC  CONCEPTS 

A  distributed  database  system  (DDBS)  is  a  set  of  sites  interconnected  by 
a  network.  Each  site  runs  two  software  modules:  a  transaction  manager  (TM) , 
which  supervises  the  execution  of  transactions;  and  a  data  manager  (DM) ,  which 
processes  read  and  write  operations  on  data  stored  at  the  site. 

A  logical  database  is  a  set  of  logical  data  items,  denoted  x,Y,Z.  A  copy 
of  a  logical  data  item  stored  at  a  site  is  called  a  physical  data  item.  We 
use  X  to  denote  the  physical  copies  of  X. 

A  transaction  is  a  program  that  accesses  the  database  by  issuing  READ  and 
WRITE  operations  on  logical  data  items.  For  notational  convenience  we  assume 
that  a  transaction  issues  all  of  its  READ'S  before  any  of  its  WRITE’ s. 

Each  transaction's  execution  is  supervised  by  one  TM.  When  a  transaction 
issues  an  operation  READ(X) ,  its  TM  selects  a  copy  of  x,  say  x^,  and  issues  an 
operation  readfx^  to  the  DM  that  manages  x^.  (We  use  upper  case  for  logical 
operations  and  lower  case  for  physical  ones.)  When  a  transaction  issues  an 
operation  WRITE(X),  its  TM  issues  an  operation  write (x^)  for  every  copy 
of  x. 

The  logical  data  items  that  a  transaction  reads (resp. writes)  is  called  the 
transaction's  readset  (resp. writeset) . 

We  mathematically  model  executions  of  transactions  in  a  DDBS  by  a  log. 

A  log  describes  the  order  in  which  read  and  write  operations  are  processed  by 


DM* s.  Formally,  a  log  is  a  partial  order*  of  read  and  write  operations. 
For  example. 


r2  Ixl>  —  *2t»l'*2,— *-r3(*lI  “ *  V2!1 

li  •  VVW'i1^  | 

riIyl^“*‘ri  ^x21— ^W1  ^xl'x2#yl^ 

is  a  log.  Notationally ,  r^Ix_.]  (resp.  w^lx^])  denotes  the  execution  of  a 

read  (resp.  write)  operation  by  transaction  i  on  data  item  x^ .  The 
arrows  indicate  the  partial  order,  which  represents  the  order  in  which  opera¬ 
tions  were  executed.  So,  in  L^,  wo{x^ ,x2 ,y^,z^]  executed  before  any  other 
operation;  r2 [x^]  executed  before  w2(x^,x2)  and  r^(x2l,  but  it  executed 
concurrently  with  r^ty^j  and  «o  forth. 

We  place  one  more  constraint  on  the  allowed  form  of  logs:  for  each 
physical  data  item  x^,  all  operations  on  x^  must  be  totally  ordered.*4 
That  is,  for  each  x^,  we  know  the  exact  order  in  which  operations  on  x. 
occurred.  We  often  relax  this  constraint  for  read  operations,  since  the  order 
of  read's  is  unimportant  anyway. 


*A  partial  order  is  a  binary  relation,  £,  that  is  reflexive  (a£a),  anti- 
eyrmetrical  (a<b  and  a£b  implies  a«b) ,  and  transitive  (a  £ b  and 
b  £c  implies  a£c).  Traditionally,  a  distributed  execution  is  modelled 
as  a  set  of  sequential  logs,  one  per  DM  [BG] .  We  prefer  using  partial 
orders  because  they  allow  operations  from  different  DM's  to  be  ordered  and 
they  do  not  require  ordering  unrelated  operations  that  can  be  executed  con¬ 
currently  at  the  same  DM. 

'•A  total  order  is  a  partial  order  in  which  every  pair  of  elements  are  related 
(i.e.,  a<€b  or  b  £a).  A  total  order  is  the  same  as  a  sequence. 
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Two  l^gs  are  equivalent  if  they  represent  executions  that  produce 
the  same  final  database  state,  and  if  each  transaction  performs  the  same 
computation  in  both  executions.  The  following  proposition  states  a 
well-known,  and  very  useful,  characterization  of  log  equivalence.  We  need 
one  more  definition  first.  Two  operations  conflict  if  they  operate  on  the 
same  physical  data  item  and  one  is  a  write. 

Proposition  1  Two  logs  are  equivalent  if  they  contain  the  same  operations, 
and  every  pair  of  conflicting  operations  appear  in  the  same 
order  in  both  logs. 


3.  CORRECTNESS  CONCEPTS 

The  correctness  of  any  system  must  be  defined  relative  to  user's  expectations. 
Intuitively,  a  system  is  correct  if  it  does  what  users  want  it  to  do.  We  assume 
that  users  expect  a  DDBS  to  behave  like  a  serial  transaction  processor;  that 
is,  users  expect  the  DDBS  to  behave  as  if  it  were  processing  transactions  one 
at  a  time,  against  the  logical,  non-replicated  database.  (This  assumption  is 
adopted  almost  uniformly  in  the  literature.)  A  DDBS  is  correct  if  it  behaves 
like  a  serial  transaction  processor  in  this  sense. 

In  this  section  we  analyze  DDBS  correctness  using  the  basic  concepts  of 
Section  2. 

A  serial  log  is  a  total  order  of  operations  such  that  for  every  pair  of 
transactions,  all  operations  of  one  transaction  precede  each  operation  of  the 
other.  For  example, 

L2  -  wo[x1,x2,y1,z1]— ♦r2[x1J— ►w2lx1,x2)— ♦rjtxj^] — *w3 

^  )  rlfylJ *>r1[xoJ — >W,  [x1  ,x0,y1) 


is  a  serial  log. 
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Consider  any  read  operation  in  a  aerial  log,  e.g.  r^x^J  above.  This 


operation  is  said  to  be  read- from  the  nearest  write  operation  before  it 


that  writes  into  its  argument.  E.g.  in  L 2  r  [x^]  reads- from  wo(x3,x2,y^,z^] » 


while  r^x^]  and  rl^x2^  read-from  w^x^x^.  Similarly,  transaction 
reade-x^-from  if  indeed  reads  x^,  writes  x^,  and  r^lx^]  reads- 
from  w^  [x^] .  E.g.  in  L 2»  T2  reads-jc^-from  Tq. 

A  serial  log  is  one-copy  equivalent  (or  simply  1-eerial)  if  for  each  trans¬ 
action  Ta,  and  for  each  x^  that  reads,  r e ads -x^- from  the  last  trans¬ 

action  before  TV  that  writes  into  any  copy  of  X. 

The  reader  can  verify  that  L2  is  1-serial.  However,  if  we  change  w2fx^,x2] 


to  w2(x2J,  the  resulting  log  is  not  1-serial. 


L3  «  wo(x1,x2,y1,z1] — *.r2[x1l— ►w2fx2] ►r3[x1]— ♦w;}[z1 

L _ >  r1[y1]--^r1[x2]--^w1[x1.x2,y1T“ 

L3  is  not  1-serial,  because  T3  reads-x^-from  To,  which  is  not  the  last 
transaction  before  T3  that  wrote  into  any  copy  of  X. 

A  1-serial  log  represents  a  serial  execution  of  transactions  in  which  the 
replicated  copies  of  each  data  item  behave  like  a  single  logical  data  item. 
Therefore,  every  1-eerial  log  ie  correct  in  the  sense  defined  at  the  beginning 
of  this  section. 

A  log  is  eerializable  (SR)  if  it  is  equivalent  to  a  serial  log.  A  log  is 
1- serializable  (1-SR)  if  it  is  equivalent  to  a  1-serial  log.  Since  every 
1-serial  log  is  correct,  and  since  every  1-SR  log  is  equivalent  to  a  1-serial 
log,  every  1-SR  log  ie  aleo  correct. 

He  adopt  1-SR  as  our  basic  notion  of  correctness  for  the  rest  of  this  paper. 
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Zf  sites  are  never  added  to  a  DOBS  and  sites  never  fail,  attaining 
1-SR  is  little  nore  than  a  concurrency  control  problem.  All  we  have  to  do 
is: 

(1)  make  sure  that  each  transaction  writes  into  all  physical  copies 
of  its  writeset,  as  described  in  Section  2;  and 

(2)  synchronize  read  and  write  operations  using  any  serializable 
concurrency  control  algorithm,  such  as  two-phase  locking. 

The  following  proposition  states  the  correctness  of  these  steps  in  terms 
of  logs. 

Proposition  2  A  log  is  1-SR  if  every  transaction  in  the  log  writes  into 
all  copies  of  its  writeset,  and  the  log  is  SR. 


4.  SITE  INITIALIZATION  ALGORITHM 

Suppose  we  have  a  DDBS  that  is  running  correctly  —  i.e.  its  execution 
is  1-SR  —  and  suppose  we  add  a  new  site  to  the  system.  We  need  to  integrate 
the  site  into  the  DDBS  in  such  a  way  that  (1)  all  data  at  the  site  can  eventually 
be  read,  and  (2)  the  resulting  execution  remains  1-SR. 

In  this  section  we  describe  an  algorithm  that  accomplishes  this  task.  First, 
we  use  the  concepts  of  Sections  2  and  3  to  specify  the  kinds  of  executions 
permitted  by  our  algorithm,  and  to  argue  that  these  executions  are  correct 
(i.e.  satisfy  requirements  (1)  and  (2)).  Then,  we  demonstrate  an  algorithm 
that  meets  the  specification. 

Specification  and  Correctness 

The  logs  that  our  algorithm  will  allow  satisfy  the  following  properties. 

Al.  Each  transaction  writes  into  all  copies  of  its  writeset,  except 
possibly  those  copies  at  the  new  site. 

A2.  By  some  time  t,  every  data  item  at  the  new  site  has  been  written 


into  at  least  once. 


A3.  No  transaction  reads  a  data  item  at  the  new  site  until  that  data 
item  has  been  written  at  least  once. 

A4.  The  log  is  SR. 

AS.  Let  x  be  a  copy  of  X  at  the  new  site,  and  let  T  be  the 
new  x 

first  transaction  that  writes  into  x  .  By  Al,  T  also  writes 

new  x 

into  the  other  copies  of  X.  Let  T'x  be  any  transaction  that 

writes  into  any  copy  of  X  after  wrote  into  that  copy.  Then 

T'  must  also  write  into  x 
x  new 


Stated  a  bit  loosely,  A5  simply  means  that  once  any  transaction  writes 

into  x  ,  all  later  transactions  also  write  into  x 

new  new 

We  now  argue  that  if  a  log  satifies  A1-A5  then  it  is  correct. 

(1)  A2  and  A3  ensure  that  all  data  items  at  the  new  site  are  eventually 
readable,  thereby  attaining  the  first  correctness  requirement. 

(2)  It  remains  to  prove  that  if  log  L  satisfies  A1-A5,  then  L  is  1-SR. 

By  A4,  L  is  SR;  let  Lg  be  any  serial  log  equivalent  to  L.  Consider  any 
reads-from  relationship  in  Lg,  e.g.  reads- x^- from  .  Lg  looks  like: 

Ls  "  **•  - ■,wj[xkl— - •>ri[xk]— ►... 


and  we  must  show  that  no  transactions  between  and  in  Lg  writes 

into  any  oopy  of  X.  We  will  show  this  by  proving  that  every  transaction  that 
follows  Tj  and  updates  any  copy  of  X,  also  writes  into  x^. 

Let  be  any  transaction  that  follows  and  updates  X.  If  is 

not  a  "new"  data  item,  then  writes  into  x^  by  Al.  Now  suppose  x^  is 

"new".  By  Al  and  Proposition  1,  T  follows  T.  in  L,  and  so  T.  writes 
into  x^  by  A5.  In  either  case,  since  writes  into  x^,  and  since 

reads-x^-from  (and  not  from  T^) ,  cannot  come  between  and  . 


Algorithm 

Rules  A1-A5  form  a  blueprint  for  a  simple  site  initialization  algorithm. 
Let  us  see  how  these  rules  can  be  attained  algorithmically. 

A1  and  A3  are  trivial  to  implement.  A4  is  merely  concurrency  control. 
Any  serializable  concurrency  control  algorithm  can  be  used.  The  remaining 
rules  can  be  implemented  as  follows: 

A2.  For  each  logical  data  item  X  stored  at  the  new  site,  run  a  copier 
transaction  that  reads  a  copy  of  X  at  an  old  site  and  writes  that 
value  into  the  new  copy.  (I.e.  there  is  one  copier  transaction 
per  X. )  Copiers  must  be  synchronized  by  the  concurrency  control 
algorithms  exactly  like  all  other  transactions. 


A5.  For  each  logical  data  item  X  stored  at  the  new  site,  designate  a 
guardian  copy  xg  of  X  at  some  old  site.  Beginning  at  some 
(arbitrary)  time  t  after  the  new  site  is  added,  the  site  holding 
xg  alerts  all  transactions  that  update  x^  to  write  into  the  new  copy 
of  X  also.  No  transaction  updates  a  data  item  at  the  new  site 
unless  told  to  do  so  by  its  guardian. 

These  five  rules  constitute  our  proposed  site  initialization  algorithm. 

This  description  of  our  algorithm  may  seem  too  abstract,  mainly  because 
we  have  not  pinned  down  the  concurrency  control  algorithm.  For  definiteness, 
let  us  see  how  the  algorithm  works  in  conjunction  with  two-phase  locking. 

Ne  begin  by  reviewing  the  basic  two-phase  locking  algorithm. 

Associated  with  each  physical  data  item  is  a  set  of  locks.  At  any  time, 
the  set  of  locks  on  a  physical  data  item  may  contain  no  locks,  one  write  lock, 
or  a  set  of  read  locks . 

Suppose  xA  is  stored  at  ttT.  Before  processing  readfx^  on  behalf  of 
transaction  T..  DM.  must  set  a  read  lock  on  x.  for  T. .  Before  processing 


-265- 


write  (x^)  on  behalf  of  Ty  EH,  must  set  a  write  lock  on  x^  for  T  ^ . 
If  OM^  cannot  set  a  lock  for  an  operation,  it  delays  the  operation  until 
the  lock  can  be  set.*  When  a  transaction  terminates,  all  of  its  locks 
are  released. 


Nov  let  us  see  how  to  add  a  new  site  to  the  system.  Suppose  sites  l,2,...,n-i 

are  running  properly  and  we  wish  to  add  site  n.  Site  n  begins  the  process  by 

sending  an  "I'm  up"  message  to  the  DM's  at  sites  l,2,...,n-l. 

Suppose  the  DM  at  site  i,  DM^,  receives  an  "I'm  up"  message  from  site  n. 

From  then  on,  for  each  guardian  copy  at  DM^ ,  when  DM^  processes  a  write (x^), 

it  tells  the  TM  that  issued  the  write  to  also  issue  a  write (x  ) ,  where  x_  is 

n  n 

the  copy  of  X  at  the  new  site.  The  DM  also  instructs  its  local  TM  to  execute 

copier  transactions  for  each  of  its  guardian  copies.  The  copier  for  x^  must 

obtain  a  read  lock  on  x  and  a  write  lock  on  x  ;  i.e.  it  must  be  synchronized 

g  n 

like  any  other  transaction. 


DM  uses  the  same  two-phase  locking  algorithm  as  every  other  DM.  However, 
n 

it  refuses  to  process  a  read(x  )  until  x  has  been  updated  at  least  once. 

n  n 

For  each  logical  data  item  X,  a  TM  issues  a  write (xR)  on  behalf  of  a  trans 


action  that  updates  X,  if  and  only  if  one  of  its  writes  on  xg  has  been  acknowledged 


by  a  message  telling  it  to  do  so.  The  TM  must  not  update  x  until  this  point  in 


time. 


*Since  operations  can  be  delayed  while  waiting  for  locks,  deadlock  is  possible. 
Deadlocks  can  be  resolved  by  any  of  the  standard  techniques  in  [BG] . 
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5.  SITE  RECOVERY 

Site  recovery  i»  the  problem  of  integrating  •  site  into  a  DDES  when  the 
site  recovers  from  failure.  As  for  the  site  initialization  problem,  the 
goal  is  to  make  the  recovered  site  look  like  all  other  sites.  Once  again, 
the  main  problem  is  to  bring  the  database  at  the  recovered  site  up-to-date 
relative  to  the  rest  of  the  DOBS. 

Site  recovery  is  obviously  an  important  problem,  but  it  has  received 

little  attention  in  the  literature.  One  early  paper  on  DOBS  reliability  [AD], 

which  mainly  studies  reliable  concurrency  control  algorithms,  disposes  of 

site  recovery  with  these  few  words « 

How  the  new  host  is  brought  up  to  date  depends  on  the 
application.  It  may  be  done  by  transferring  to  that 
host  the  journal  of  all  updates  since  the  host  went  down. 

It  may  require  transferring  the  database.  [AD,  p.  568] . 

Other  related  work  includes  [HS,  LS,  LSP,  MPM,  Th,  Sk,  SS] .  Some  of  these 
papers  [MPM,  Th]  are  like  [AD]  in  that  they  mainly  study  reliable  concurrency 
control.  (A  piece  of  the  algorithm  in  [MPM]  is  called  Single  Node  Recovery. 

But  the  algorithm  only  recovers  the  concurrency  control  algorithm  at  the  site, 
not  the  database.)  Other  papers  study  atomic  commitment  [HS,  LS,  Sk,  SS] ,  site 
monitoring  to  keep  track  of  which  sites  are  up  [HS],  and  other  distributed 
decision  problems  [LSP].  Again,  site  recovery  in  our  sense  is  not  studied. 

One  paper  that  does  treat  site  recovery  is  [HS] .  The  solution  is  based  on 
the  concept  of  Reliable  Network  (Relnet) ,  a  virtual  machine  that  guarentees 
reliable  message  delivery  despite  site  failures.  The  Relnet  is  intended  to  be 
a  very  general  facility  suitable  for  many  kinds  of  distributed  systems.  Because 
of  this  generality,  the  mechanism  is  rather  complex. 

Our  approach  to  site  recovery  is  narrower  (and,  we  hope,  simpler)  than  the 
Relnet  approach.  We  are  not  trying  to  attain  reliability  for  arbitrary  distri¬ 
buted  systems;  nor  are  we  trying  to  solve  all  DDBS  reliability  problems.  Our 
goal  is  simply  to  integrate  the  database  at  a  recovered  site  into  a  running  DDBS. 


Evidently,  site  recovery  end  site  initialisation  are  almost  identical 


problems,  and  the  algorithm  of  Section  4  can  be  directly  applied  here. 

There  is  one  major  caveat:  our  algorithm  says  nothing  about  multiple  failures. 

We  believe  the  algorithm  can  be  generalized  to  handle  multiple  failures,  but 
offer  no  hard  evidence  in  this  respect.  Despite  this  caveat,  the  algorithm 
of  Section  4  solves  a  big  piece  of  the  site  recovery  problem. 

An  Optimization 

When  using  the  initialization  algorithm  for  site  recovery,  an  important 
optimization  is  possible.  It  is  not  necessary  to  fire  up  copier  transactions 
for  all  X  in  the  logical  database.  Suppose  we  are  recovering  site  f. 

Only  those  X  that  were  updated  after  site  f  failed  need  to  be  recovered. 

Any  X  that  was  not  updated  while  f  was  down  still  has  the  correct  value  at 
f  when  the  site  recovers.  If  a  spool  (or  journal)  of  all  writes  to  site  f  is 
maintained  while  f  is  down,  as  in  SDD-1  [HS] ,  then  when  f  recovers  the 
following  processing  can  be  done.  Scan  the  spool  to  produce  a  list  of  data  items 
that  were  written  while  f  was  down.  All  data  items  not  on  the  list  can  be 
immediately  marked  as  readable  at  DM^.  Copiers  are  executed  only  for  data  items 
on  the  list. 

Notice  that  we  are  not  proposing  that  spooled  write  operations  be  processed 
in  FIFO  order,  as  in  SDD-1.  If  X  was  written  several  times  while  f  was 
down,  only  the  last  value  should  be  sent  to  f.  If  earlier  values  are  sent, 
the  algorithm  will  not  work  correctly. 


6.  SITE  BACKUP 

A  backup  database  is  a  static  copy  of  the  database  that  is  consistent  but 
potentially  out-of-date.  One  use  of  backup  databases  is  to  speed  up  the  pro¬ 
cessing  of  queries.  By  reading  the  static  backup,  a  query  does  not  interfere 
with  updates,  and  so  will  not  be  delayed  or  restarted  for  concurrency  control 
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reasons.  The  cost  is  that  it  My  read  out-of-date  data.  Backup  databases  are 
also  useful  for  archiving  data. 

Creating  a  backup  database  is  similar  to  initializing  a  new  site  or  recover¬ 
ing  a  failed  site  —  similar  enough  that  we  can  use  our  initialization  algorithm 
to  do  most  of  the  job. 

We  begin  by  pretending  that  the  backup  database  is  a  new  site  being  added  to 
the  DDBS.  We  run  the  initialization  algorithm  to  bring  the  backup  database 
up-to-date,  until  all  data  items  in  the  backup  have  been  written  at  least  once. 

Now  we  must  freeze  the  backup,  by  shutting  off  writes  to  it.  However,  we  must 
shut  off  writes  carefully,  so  that  the  backup  is  frozen  in  a  consistent  state. 

We  can  do  this  sinply  by  running  a  query  that  (conceptually)  reads  the  entire 
backup  database ,  and  by  ensuring  that  no  data  item  is  written  once  the  query 
has  read  it.  This  freezes  the  backup  copy  in  the  state  read  by  the  query.  Since 
the  query  is  synchronized  by  a  serializable  concurrency  control  algorithm,  the 
frozen  state  is  consistent. 

For  example,  suppose  we  use  the  two-phase  locking  initialization  algorithm 
of  Section  4.  When  all  backup  data  items  have  been  written  at  least  once,  we 
run  a  query  that  sets  a  read  lock  on  every  backup  data  item.  (The  query  may 
deadlock  while  trying  to  obtain  its  locks,  and  so  may  need  to  be  aborted  and 
restarted.)  When  all  backup  data  items  are  locked,  we  shut  off  updating  by 
refusing  to  process  any  more  writes  against  the  backup.  The  resulting  backup 
database  is  consistent  and  can  be  correctly  queried  without  synchronization. 

One  problem  with  this  algorithm  is  that  the  "shut-off"  query  may  deadlock 
repeatedly  and  never  finish.  This  problem  can  be  fixed  as  follows.  Once  the 
query  begins,  the  backup  should  refuse  to  grant  any  write  lock  requests  from 
transactions  that  have  not  already  set  a  lock  on  some  backup  copy.  These  requests 
are  sinply  blocked,  and  the  transactions  delayed,  until  the  query  manages  to 
get  all  of  its  locks.  Then  a  very  counterintuitive  event  happens  —  the  lock 


requests  are  unblocked,  but  since  the  backup  is  now  frozen,  the  transactions 
no  longer  need  the  locks!  (It  does  not  work  to  unblock  the  transactions 
earlier.) 

7.  CONCLUSION 

We  have  presented  an  algorithm  that  can  be  used  to  initialize  a  database 
at  a  new  site  in  a  DDBS ,  to  recover  a  database  at  a  formerly  failed  site,  or 
to  create  a  consistent,  static  backup  database.  The  algorithm  is  simple,  yet 
introduces  little  overhead  beyond  what  is  normally  needed  for  concurrency 
control.  We  therefore  believe  it  is  a  practical  solution  to  all  three  problems. 

The  methodology  that  we  used  to  describe  our  algorithm  is  also  interesting, 
we  believe.  First,  we  defined  correctness ,  i.e.  what  it  means  for  an  algorithm 
to  correctly  solve  the  problem;  this  definition  was  stated  in  terms  of  execu¬ 
tions  (i.e.  logs).  Second,  we  specified  the  kinds  of  logs  that  our  algorithms 
would  allow,  and  proved  that  every  allowable  log  is  correct.  Third,  we  des¬ 
cribed  an  abstract  algorithm  that  meets  the  specification.  Finally,  we  gave 
a  concrete  implementation  of  the  abstract  algorithm.  These  four  steps, 

(i)  defining  correct  logs, 

(ii)  specifying  an  allowable  subset  of  the  correct  logs, 

(iii)  designing  an  abstract  algorithm  that  produces  allowable  logs, 

(iv)  engineering  a  concrete  implementation  of  the  abstract  algorithm, 
helps  structure  the  problem  and  the  search  for  solution. 

One  benefit  is  that  we  can  engineer  new  concrete  algorithms  for  specific 
systems  or  problems  just  by  redoing  step  (iv) .  For  example,  the  concrete 
implementation  of  the  backup  algorithm  in  Section  6,  may  have  bad  performance 
characteristics :  By  locking  the  entire  backup  database,  the  "shut-off"  query 
interferes  with  many  updates.  This  performance  problem  is  not  inherent  in  the 
abstract  algorithm;  it  is  just  an  artifact  of  the  concrete  implementation  we 
gave.  A  better  inq>lementation  would  use  a  concurrency  control  algorithm  for 


the  backup  in  which  queries  and  updates  interfere  less.  Multiversion 
concurrency  control  algorithms  [BHR,  Re,  SR]  are  likely  candidates  for 
this  role.  Engineering  a  backup  algorithm  that  uses  multi version  con¬ 
currency  control  is  by  no  means  a  trivial  task.  But  structuring  the 
problem  as  we  have  done,  the  designer  does  not  have  to  start  from  scratch 
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0.  INTRODUCTION 

Most  automatic  crash  recovery  mechanisms  for  database  systems  are  based 
on  the  concept  of  transaction  c&mitment.  Speaking  very  informally,  when  the 
system  designates  a  transaction  to  be  committed,  it  "promises"  to  install  in 
the  database  all  the  updates  effected  by  that  transaction.  Put  another  way, 
should  a  crash  occur  after  a  transaction  has  become  committed,  the  transaction 
may  not  be  restarted.  If,  however,  a  crash  occurs  before  a  transaction  has 
become  committed,  that  transaction  must  be  restarted.  Several  mechanisms  that 
achieve  this  behavior  have  been  proposed  by  database  system  designers  (e.g. 

[R  75],  [Lo  77],  [G  78],  [Li  79]).  In  all  these  systems,  when  a  transaction 
is  restarted,  it  is  "rolled  back"  all  the  way  to  the  beginning.  One  exception 
to  this  rule  is  System-R  which  allows  to  roll  an  uncommitted  transaction  back 
to  an  earlier  "savepoint",  which  is  not  necessarily  its  beginning  [A  76]. 

In  this  paper  we  investigate  the  proolem  of  finding  the  optimal  set  of 
savepoints  (one  per  transaction)  to  which  uncommitted  transactions  executing 
concurrently  must  be  rolled  back  after  a  crash,  so  that  the  recovery  cost  is 
minimized. 

This  paper  is  organized  as  follows.  Section  1  informally  motivates  the 
problem.  In  Section  2  we  present  a  more  formal  model  of  transition  execution 
in  terms  of  which  our  results  are  stated  and  proved.  In  Section  3  we  give  an 
algorithm  for  the  problem  under  consideration  and  prove  its  correctness  and 
optimality.  In  many  environments,  cascading  restarts  are  considered  intolerable, 
and  sufficiently  restrictive  schedulers  are  used  to  prevent  them.  In  such 
environments  ti.a  problem  dealt  with  in  this  paper  has  a  trivial  solution. 

This  issue  is- discussed  in  Section  4. 
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THE  PROBLEM 


We  consider  a  database  system  in  which  transactions  operate  on  the  data¬ 
base  concurrently.  The  system  periodically  takes  "transaction  savepoints"  for 
the  transactions  currently  active.  A  transaction  savepoint  involves  saving 
the  current  state  of  a  transaction  in  non-volatile  storage.  What  exactly 
constitutes  the  "current  state"  of  a  transaction  depends  on  details  that  we 
do  not  care  to  consider  here.  Typically  it  would  contain  such  information  as 
the  program  counter,  the  values  of  all  local  variables  created  by  the  trans¬ 
action  so  far,  any  locks  held  by  the  transaction,  etc.  We  make  no  assumptions 
concerning  the  timing  of  the  savepoints.  They  may  be  asynchronous  (i.e. 
happening  at  different  times  for  different  transactions) ,  or  occurring  with 
different  frequencies  for  different  transactions.  They  system  may  be  taking 
savepoints  at  its  convenience.  Alternatively,  the  transactions  themselves 
could  specify  when  savepoints  are  to  be  taken.  All  the  savepoints  of  a  trans¬ 
action  must  be  kept  until  that  transaction  becomes  committed.  That  is,  a 
savepoint  of  a  transaction  must  not  "overwrite"  a  previous  savepoint  of  that 
(or  any  other)  transaction. 

If  a  crash  occurs,1  the  recovery  algorithm  selects  an  appropriate  set  of 

savepoints,  one  per  (uncommitted)  transaction.  Each  (unconmitted)  transaction 

is  then  rolled  back  to  its  corresponding  savepoint,  and  processing  continues 

2 

from  there,  as  if  the  crash  had  never  occurred. 


^e  are  only  concerned  with  "soft"  crashes  in  this  paper — i.e.  crashes  in  which 
volatile  storage  is  lost,  but  non-volatile  storage  is  not  affected. 

Sfe  must,  of  course,  "synchronize"  the  actual  database  with  the  state  of 
affairs  reflected  by  rolling  back  the  uncommitted  transactions.  For  example, 
all  updates  of  any  such  transaction  that  occurred  after  the  savepoint  to  which 
the  transaction  is  rolled  back  must  be  "undone”  from  the  database.  This  can 
be  achieved  if  the  database  system  maintains  an  audit  trail.  This  technique 
is  well-known  (see,  for  example,  [G  78],  [Li  79])  and  will  not  be  addressed 
here.  This  question  is  dealt  with  in  an  environment  akin  to  the  one  considered 
here  in  [F  81].  Also,  the  conditions  under  which  rolling  back  transactions  does 
not  endanger  the  correctness  of  the  database  state  are  discussed  in  [H  81] . 


At  this  point  it  night  appear  sufficient  to  simply  roll  back  each  trans¬ 
action  to  its  last  savepoint.  Unfortunately,  this  naive  approach  could  well 
lead  to  inconsistent  database  states.  The  fundamental  reason  is  that  operations 
on  a  database  issued  by  different  transactions  are  not  independent  of  each 
other,  informally,  we  say  that  an  operation  a  depends  on  operation  b,  if  a 
was  executed  after  b  and  the  result  of  a  would  have  been  different  if  b 
had  not  been  executed.  For  example,  a  Read(x)  depends  on  the  immediately 
previous  Write (x) ,  and  a  Lock(x)  depends  on  the  immediately  preceding 
Unlock (x)  (assuming  exclusive  locking) .  The  exact  nature  of  the  dependence 
of  operations  on  one  another  is  not  important  here.  We  simply  assume  that 
some  dependence  relation  is  specified  which  correctly  reflects  the  semantics 
of  the  various  operations. 

Consider  now  the  following  sequence  of  events: 

(1)  savepoint  of  transaction  T^  is  taken 

(2)  operation  of  T^  is  executed 

(3)  operation  a^  of  transaction  T^  is  executed 

(4)  savepoint  of  T^  is  taken 

(5)  the  system  crashes. 

Assume,  further,  that  operation  a^  depends  on  a^  If  we  were  to  roll  back 
each  of  T^,  T^  to  their  last  savepoints,  we  would  be  restarting  the  system 
at  a  state  in  which  operation  a^  is  executed  (since  it  happened  before  T^'s 
last  savepoint) ,  while  the  (on  whose  execution  a^  depends)  is  not 

executed  (since  it  happened  after  T^'s  last  savepoint).  This  is  an  inconsis¬ 
tent  execution  state,  at  least  according  to  our  informal  notion  of  operation 
dependence . 

This  example  illustrates  that  it  is  not  always  correct  to  roll  back  each 
(uncomnitted)  transaction  to  its  last  savepoint.  Our  task  now  is  to  provide 


an  algorithm  for  choosing  a  set  of  savepoints  S*,  one  per  uncommitted  trans¬ 
action,  such  that 


(i)  the  execution  state  reflected  by  rolling  back  each  transaction  to 
the  corresponding  savepoint  is  consistent,  and 
(ii)  S*  is,  in  a  sense,  optimal. 

These  two  conditions  will  be  defined  precisely  in  the  next  section. 


2.  FORMAL  MODEL 

Each  transaction  consists  of  a  sequence  of  atomic  steps  or  operations 

a^,  ai2'*“'aiX  *  *'or  eac^  we  have  a  set  of  savepoints 

SPi  “  *sil,si2"'*'sim  *  and  a  mapping  ^iSP^  {1,2, . . .  ,1^ ,  such  that 

^i  *Sij*  <  ^i  ^sik^ '  if  l<j<k<mi*  Intuitively,  savepoint  s^  was  taken 

after  the  execution  of  step  a  and  before  the  execution  of  a,  .  ,  where 

k*$^(s^j).  A  concurrent  execution  of  T^,...,Tn  is  a  sequence  consisting 

of  the  elements  of  A«{a^:  l<i<n,  l^j^X^},  respecting  the  order  of 

atomic  steps  belonging  to  the  same  transaction.  With  respect  to  a  concurrent 

execution  E  of  Tj,...,Tn  we  are  given  a  partial  order  -*-Ec  a*A,  where 

we  require  that  if  a^  -*-£  a^ ,  ^ ,  then  a^  appears  before  a^,j,  in  E. 

is  called  the  dependence  relation,  and  formalizes  the  intuitive  notion  of 

operation  dependence  discussed  in  the  previous  section:  if  a  -*•_  b  then  in 

execution  E,  step  a  "depends"  on  step  b.  Since  we  shall  be  dealing  with  a 

fixed  execution  E,  the  subscript  will  be  omitted  from  the  symbol  " .  Let 

E 

S-{slt  ,...,snt  }  where,  as  usual,  s^t  is  a  savepoint  of  transaction  . 
In  i 

We  shall  My  that  S  is  a  consistent  set  of  savepoints  (with  respect  to  an 
execution  E),  iff  S  satisfies  the  following  condition: 


(•)  if  •ij"*ai.j.  then  either  j<^(sit)  or  j  •  >  4^,  (si>t  )  . 


-2 


Intuitively  this  means  that  if  an  operation  a^ ,  ^ ,  depends  on  a^  and  we 

roll  back  each  transaction  T^  to  the  corresponding  savepoint  in  S,  we  are 

restarting  the  system  in  a  state  where  either  a^  has  been  executed 

)  or  has  not  been  executed  (j'>^i,(si,t  )).  As  we 

argued  informally  in  the  previous  section,  unless  this  condition  is  satisfied, 

the  state  we  are  restarting  our  system  from  is  inconsistent. 

Now  we  need  to  define  what  is  meant  by  an  "optimal"  set  of  savepoints. 

To  do  so,  we  define  the  class  of  "reasonable"  cost  functions.  Let  s . ,  s. 

it.  xr. 
i  1 

be  savepoints  of  T^.  We  say  that  f(x^,...,xn>  is  a  reasonable  cost,  function 

iff  for  all  j,  l<j<n,  if  t^  *  r^  for  i  t  j  and  t^<r^,  then 

f(s,.  ,  ...,s  .  )  >f(s,  . s  ).  This  captures  the  idea  that  the  more  we 

it,  nt  lr,  nr 

lnln 

roll  back,  the  greater  the  cost  of  recovery. 

We  say  that  a  set  of  savepoints  S*  is  optimal  iff  it  minimizes  a  given 
reasonable  cost  function  among  all  consistent  sets  of  savepoints.  Fortu¬ 
nately,  the  same  S*  minimizes  aZZ  reasonable  cost  functions  (Lemma  3) . 

The  class  of  reasonable  functions  includes  such  obvious  candidates  for 

n 

optimization  as  *<*lt  ^  "  *i-l  c**it  *  *n,J  **8lt  '*",snt  *  * 

In  i  In 

■a*!^  c<*it  wh#re  c*8it  *  i8  the  co8t  of  r*®tarting  T^  from  save¬ 
point  s.  ,  under  the  assumption  that  it  costs  more  to  restart  a  transaction 
Ati 

from  an  earlier  savepoint. 


3.  THE  ALGORITHM  AND  THE  PROOF 

Me  now  describe  an  algorithm  for  finding  the  optimal,  consistent  set  of 
savepoints  that  minimizes  any  reasonable  cost  function. 


Algorithm  1 

Input  i  Fdr  each  transaction  the  set  SP^^ ,  the  function  $  *  SPi  -*■ 
{lf2, . . .  and  the  dependence  relation  on  the  set  A-ta^;  l<i<n. 

Output :  A  consistent  set  of  savepoints  S*  minimizing  any  reasonable  cost 
function. 

Method; 

Step  0;  Construct  a  digraph  G*  (N,E)  with  node  set  N-is^:  l<i<n, 

1<  j  <mi>  U  {si();  l<i<n},  where  by  fiat,  and  edge  set 

E *  { (s^j *  j , ) :  there  are  steps  a^,  a^^,  such  that  a^  +  a^  ,  8113 
♦t<*ij)<p<*t(.1(j+1)  »a 
Step  1:  Initialize  t^  ;»  m^,  l<i<n. 

Step  2; 

while  3(sij»si,jt)  €E  s.t.  j>t±  and  j'^t^ 
do  t. ,  ;*  t . ,  -  1  od 

—  x '  x '  — 

Step  3 :  S*  ;*  {s^  ,...,S^^  }• 

*1  n 


He  now  prove  the  correctness  and  optimality  of  Algorithm  1  through  a  sequence 
of  Lemmata. 

lemma  1:  Algorithm  1  always  terminates. 

Proof :  The  only  reason  it  might  not,  is  the  while  loop  of  Step  2.  Note, 
however,  that  each  time  the  loop  is  executed,  exactly  one  t^,  is  decremented 
by  1.  Moreover,  if  t^-0,  all  edges  (s^  ,s^ , ^ , )  €  E  violate  the  while 
condition  because  j ' >  1  (as  there  are  no  edges  involving  s^,0)  and  hence 

^Note  thar  there  is  a  convenient  abuse  of  notation  here  in  that  we  confuse 
the  nodes  of  G  with  the  savepoints. 


j*  >  ti#.  Therefore,  no  t^,  needs  to  be  decremented  below  0  and  after  at  most 
m^  iterations  the  while  loop  will  terminate.  o 


lemma  2  s  A  set  of  savepoints  s*  {s^t  , ...,snt  }  is  consistent  iff  no  edge  of 

1  n 

G  satisfies  the  while  condition  in  Step  2  of  Algorithm  1, 

Proofs  [only  if]  Suppose  S  is  consistent,  yet  there  is  an  edge  (s^ , s^ ,  ^ 6  E 

such  that  j>^  and  j'<t^,.  By  construction  of  G,  we  have  that  there  exist 

a.  ,  a.  ,  ,  such  that  a.  -*-a,,  ,  and  (s,  .)  <  p<  (s.  and 

ip  i'p'  ip  ip  i  i3  i  i»3+l 

^i’ (si,,j'-l)  <P' <<>i,(si,j,>*  But  then  *i(siti)  ^i^ij*  <p  “** 
p' (s^f (s^,^  ),  violating  the  consistency  condition. 


condition. 


[if]  Suppose  S  is  not  consistent.  That  is,  there  exist  a  and  a.,  , 

IP  i  P 

such  that  aip-*aitp,  and  (1)  P>^ii8it  >  and  (2)  ).  Let 

j,  j’  be  such  that  <3)  ^(s^)  <p<4>i<si  ^+1>  and  (4)  4^,  (sit  j,.!)*1?’  * 

4>i ,  (si ,  j  , ) .  By  the  construction  of  G  it  follows  that  (s^  * s^,  j , )  €  E.  From 

(1) ,  (3)  it  follows  that  j  >  t^  and  from  (2),  (4)  that  j'<tit.  Thus 

(s, .,s. , . satisfies  the  condition  of  the  while  loop.  o 

13  i'j 

Lemma  2  implies  the  following 


COROLLARY  Is  When  Algorithm  1  terminates,  S*  is  a  consistent  set. 

lemma  3s  There  is  a  unique  consistent  set  that  minimises  all  reasonable 
functions. 

Proofs  Let  S«{s,J.  ,...,s  .  }  and  S'-{s.  ,...,s..  }  be  two  distinct 

1  It.  nt  ir,  nr 

In  In 

consistent  sets  both  minimizing  some  reasonable  function  f.  That  is, 

f  (s. .  » .  •  *  s.  )  *  f(s.  ,...,  s  )  •  We  claim  that  S  •  {s.  i  } , 

ntn  lrl  nrn  lql 

where  q^  »max(t^,r^)  is  also  a  consistent  set.  For  if  not,  by  Lemma  2, 

there  would  be  an  edge  (s^  ,Bi*  j  •  >  ^  E  *uch  that  ^gi  3 '  ^ *3^.  • 


then 


There  are  four  cases  to  be  considered:  (1)  if  g,  »t,  and  a..  «t.. 

i  1  i 1  i  • 


S  is  not  a  consistent  set;  (2)  if  qA  ■  x±  and  qi,«ri,,  then  S’  is  not 
a  consistent  set?  (3)  if  qA  ■  ^  and  q^-r^,  then  S’  is  not  a  con¬ 
sistent  set;  and  (4)  if  q^  ■  r^  and  q^,»t^,  then  S  is  not  a  consistent 

set.  All  cases  contradict  the  assumption  that  S  and  S'  are  consistent. 
Hence  S  is  also  consistent.  By  the  definition  of  reasonable  functions,  it 


cannot  be  that  s.  <  s .  or  that  s.  > s .  for  all  i,  for  otherwise  it 
1  i  i  lri 

could  not  be  that  both  S  and  S'  minimize  a  reasonable  f.  Hence  S  is 


different  from  both  S  and  S',  and  by  the  definition  of  reasonable  functions, 

.,s  ),  contradicting  the 


f(s.  ,...,s  )<f(s..  ,...,s  .  )«f(s, 

lq,  nq  It,  nt_  lr. 


nr 


1  n  — 1  ~n  "1  n 

assumption  that  S  and  S'  both  minimize  f.  We  have  shown  that  for  any 


given  reasonable  cost  function  f  there  exists  a  unique  optimal  consistent 
set  of  savepoints.  It  is  now  easy  to  show  that  the  same  set  is  optimal  for 
all  reasonable  cost  functions.  Let  f,  f*  be  two  such  functions,  and 
suppose  that  the  respective  optimal  consistent  sets  of  savepoints  are  5,  S'. 
Unless  t^*^  for  l^i^n,  we  would  have  that  f  and  f'  yield  a  smaller 
cost  on  S  (defined  as  above)  than  on  the  optimal  S  and  S'  respectively, 
a  contradiction.  □ 


Lemma  3  justifies  our  quest  for  the  optimal  consistent  set  of  savepoints 

(as  opposed  to  8ame  optimal  such  set) . 

Let  S»{s,.  ,...,s  .  },  S'  ■  {s,  ,...,s  }  be  sets  of  savepoints, 

ic,  nr  lr,  nr 

In  In 

one  per  transaction.  S  and  S'  are  not  necessarily  consistent.  We  say 
that  S  transfoime  in  one  etep  to  S',  written  S  >-  S',  iff  there  is  an  edge 
(s..,s.  *)€£  such  that  j  >  t .  and  j_<t.  ,  and  r .  *  t  -1  and  r .  « t . 

*3  io30  1  0  i0  i0  io  11 

for  all  Note  that,  according  to  this  definition.  Step  2  of  Algorithm  1 

repet atively  transforms  a  set  of  savepoints  until  it  cannot  transform  it  any 


aore.  We  say  that  S  transforms  to  s'  iff  s  t-  S' 
transiting  closure  of  f-. 


where  ►  is  the 


LEMMA  4:  If  sq  *"  sq  *"  si '  si  s  and  s  is  consistent ,  then 


S'x  K+  s. 


Proof  (sketch):  Recall  that  by  the  definition  of  S  i-  S',  the  savepoints 
of  S  and  S'  are  identical  except  those  of  one  transaction.  Let  t(S,S') 


denote  the  index  of  that  transaction — i.e.  t(S,S')  where  S,S'  are  as 


in  the  definition  of  given  previously.  Now  let  SQ  t-  S1  t-  S2  t- 


«  S.  We  also  have  that  S Q  *-  S^.  We  shall  define  inductively 


SI, Si,. . .  ,S,'  such  that  S!  •-  S'  for  l<i<k  and  S '  =  S,  *S.  This  is 
2  3k  i  i+l  k  k 


done  as  follows.  Let  m  be  the  minimum  index  such  that  t(s  .  ,S  )  =t(S_,S'). 

m—  1  m  0  1 


Such  an  m  exists,  for  otherwise  S  would  not  be  consistent.  Having  defined 


S' , . . .  ,S'.  for  some  j<m  we  define  S'  by  choosing  t(S!  ,S'  )  «=  t  (S.  ,S.) 

1  3  3+1  3  3+1  D~i  3 


It  is  easy  (but  tedious)  to  verify  that  this  can  be  done  legally,  that 


S'.  H  S!  ,  for  l<j<m  and  that  S'*S  .  We  may  then  define  S!  =  S.  for 
3  3+1  J  m  n  J  3D 


m<j<k,  thus  getting  SQ  -  S^  ...  K  S£,  with  S^»Sk*S.  Hence, 


S|  S,  as  wanted. 


Lemma  4  implies  that  Algorithm  1,  which  as  stated  is  non-determini stic, 
will  always  terminate  with  the  same  consistent  set  of  savepoints  S* ,  no 
matter  in  what  order  the  edges  of  E  are  considered  in  Step  2. 

We  are  now  in  a  position  to  state  and  prove 
THEOREM  1:  Algorithm  1  computes  the  optimal  consistent  set  of  savepoints 
with  respect  to  any  reasonable  cost  function. 

Proof:  "Let  S*«{slt  , ...,s  }  be  the  set  of  savepoints  computed  by 

l  n 

Algorithm  1,  and  let  S  «  {s,  ,...,s _  }  be  the  set  minimizing  any 

m  lr,  nr 

X  ft 

reasonable  cost  function.  S  is  well-defined  by  Lemma  3.  By  that  same 

m 
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lemma,  r.  >t.  for  l<i<n.  Hence,  if  S*^S  there  is  non-empty  set 
xi  m 

B-{k:  l<k<n  and  t ^  <  r^ } .  Consider  some  execution  of  Algorithm  1  computing 
S*  and  let  i^  £  B  be  the  element  of  B  such  that  at  some  stage  of  the  execu¬ 
tion  t  was  set  equal  to  r.  in  Step  2,  while  for  all  k€B-{in),  t  >r  . 

0  x0  0  k  k 

We  claim  that  it  is  possible  to  select  edges  satisfying  the  while  condition 

of  Step  2  so  that  t.  is  not  decremented  before  for  all  kCB-  {i  }, 

10 

t^-r^.  For,  if  this  is  not  the  case  then  there  exists  some  point  in  the 

execution  of  Algorithm  1  such  that  t,  ■  r .  ,  t  >  r.  for  all  k€B-{i  }, 

*0  10  k  K  0 
and  the  only  edge(s)  satisfying  the  while  condition  of  Step  2  is  (are)  of 

the  form  (s  ,s.  ,)  for  some  p,  q,  q*.  Note  that  at  this  point  t.>r. 

P^3  ^  0q  ^  ^ 

for  all  1  <  i  <  n.  Then  we  would  have  that  (at  that  point  of  the  execution) 
q>t  and  q'<t.  .  But  we  also  have  that  t.  *r.  and  t  >r  .  There- 

P  X0  X0  X0  p  P 

fore,  q>r  and  q'  < r ,  and  hence  S  is  not  consistent  by  Lemma  2 

P  x0  m 

(since  the  edge  (s  ,s.  ,)  satisfies  the  while  condition  for  S*S  ). 

pq  i0qf  m 

This  contradicts  the  choice  of  S  .  Therefore,  we  may  in  fact  choose  edges 

m 

as  claimed  above.  But  then  either  the  algorithm  will  terminate  before 


t.  =r.  for  all  i€B  -  {i_} — and  then  it  would  be  the  case  that  there  exists 

ii  0 

some  k  such  that  t,  >  r.  ,  contradicting  the  choice  of  S  and  Lemma  3 — or, 

k  k  m 

at  some  stage  t.  «r,  for  all  i€B.  By  the  definition  of  B,  t.  =r.  for 

ii  ii 

all  l<i<n.  Hence  S*<*S  ,  as  wanted.  o 

m 


4.  SOME  REMARKS 

The  interdependency  of  operations  issued  by  different  transactions 
sometimes  causes  a  transaction  to  restart  due  to  a  failure  of  some  other 
transaction.  This  "domino  effect”  is  known  as  cascading  restart,  and  is 
usually  considered  to  be  undesirable.  It  is  particularly  serious  if  the 


■-VvV  .  >Y.'. 


.v.  V-  . 


- .  - . » .'* .  •  -  •  ^  . 


transactions  that  might  have  to  be  restarted  are  committed,  as  this  defeats 
the  very  purpose  of  commitment.  The  usual  remedy  is  to  adopt  a  sufficiently 
conservative  scheduling  policy  that  avoids  cascading  restarts.  If  such  a 
strategy  is  used  the  optimization  problem  treated  in  this  paper  is  not  an 
issue. 

For  example,  in  System-R  where  this  savepoint  feature  has  been 
implemented  it  is  always  possible  to  roll  each  transaction  back  to  its  last 
savepoint:  this  is  sufficient  to  guarantee  consistency  and  is  obviously 
optimal.  The  reason  this  works  is  that  System-R  requires  that  all  of  a 
transaction's  locks  be  held  until  that  transaction  is  committed.  There¬ 
fore,  there  are  no  dependencies  between  uncommitted  transactions;  the  graph 
G  constructed  by  Algorithm  1  would  contain  no  edges;  and  the  initial  choice 

of  savepoints  {s,  ,...,s  }  would  be  consistent, 

lm.  nm 

1  n 

The  optimization  problem  treated  in  this  paper  becomes  non-trivial  in 
environments  where  either  cascading  restarts  are  not  an  issue  or  methods 
other  than  restrictive  schedulers  are  employed  to  avoid  them.  An  example 
of  such  an  alternative  strategy  would  be  to  delay  commitment.  The  benefit 
of  such  other  strategies  is  that  less  restrictive  schedulers  allow  greater 
parallelism  and,  therefore  presumably  better  response  time  and  resource 
utilization.  We  know  of  no  study,  however,  which  provides  quantitative 
evidence  supporting  either  strategy.  This  remains  a  major  research  area. 
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